Hacker News new | past | comments | ask | show | jobs | submit login
Fire declared in OVH SBG2 datacentre building (ovh.net)
1141 points by finniananderson on March 10, 2021 | hide | past | favorite | 589 comments



Back in the late 90s, I implemented the first systematic monitoring of WalMart Store's global network, including all of the store routers, hubs (not switches yet!) and 900mhz access points. Did you know that WalMart had some stores in Indonesia? They did until 1998.

So when https://en.wikipedia.org/wiki/May_1998_riots_of_Indonesia started happening, we heard some harrowing stories of US employees being abducted, among other things.

Around that same time, the equipment in the Jakarta store started sending high temperature alerts prior to going offline. Our NOC wasn't able to reach anyone in the store.

The alerts were quite accurate: that was one of the many buildings that had been burned down in the riots. I guess it's slightly surprising that electrical power to the equipment in question lasted long enough to allow temperature alerting. Most of our stores back then used satellite for their permanent network connection, so it's possible telcom died prior to the fire reaching the UPC office.

In a couple of prominent places in the home office, there were large cutouts of all of the countries WalMart was in at the time up on the walls. A couple of weeks after this event, the Indonesia one was taken down over the weekend and the others re-arranged.


Thanks for sharing this interesting story. Part of my family immigrated from Indonesia due to those riots, but I was unaware up until today of the details covered by the Wikipedia article you linked.

I remember during the 2000s and 2010s that WalMart in the USA earned a reputation for it's inventories primarily consisting of Chinese-made goods. I'm not sure if that reputation goes all the way back to 1998, but it makes me wonder if WalMart was especially targeted by the anti-Chinese element of the Indonesian riots because it.


I can't recall (and probably didn't know at the time...it was far from my area) where products were sourced for the Indonesia stores.

Prior to the early 2000s, WalMart had a strong 'buy American' push. It was even in their advertising at the time, and literally written on the walls at the home office in Bentonville.

Realities changed, though, as whole classes of products were more frequently simply not available from the United States, and that policy and advertising approach were quietly dropped.

Just for the hell of it, I did a quick youtube search: "walmart buy american advertisement" and this came up: https://www.youtube.com/watch?v=XG-GqDeLfI4 "Buy American - Walmart Ad". Description says it's from the 1980s, and that looks about right.


1998 I had a friend who was supplying them with all sorts of hardware (hammers etc) from China.

The switch had already begun.


What the hell, here's another story. The summary to catch your attention: in the early 2000s, I first became aware of WalMart's full scale switch to product sourcing from China by noting some very unusual automated network to site mappings.

Part of what my team (Network Management) did was write code and tools to automate all of the various things that needed to be done with networking gear. A big piece of that was automatically discovering the network. Prior to our auto discovery work, there was no good data source for or inventory of the routers, hubs, switches, cache engines, access points, load balancers, VOIP controllers...you name it.

On the surface, it seems scandalous that we didn't know what was on our own network, but in reality, short of comprehensive and accurate auto discovery, there was no way to keep track of everything, for a number of reasons.

First was the staggering scope: when I left the team, there were 180,000 network devices handling the traffic for tens of millions of end nodes across nearly 5,000 stores, hundreds of distribution centers and hundreds of home office sites/buildings in well over a dozen countries. The main US Home Office in Bentonville, Arkansas was responsible for managing all of this gear, even as many of the international home offices were responsible for buying and scheduling the installation of the same gear.

At any given time, there were a dozen store network equipment rollouts ongoing, where a 'rollout' is having people visit some large percentage of stores intending to make some kind of physical change: installing new hardware, removing old equipment, adding cards to existing gear, etc.

If store 1234 in Lexington, Kentucky (I remember because it was my favorite unofficial 'test' store :) was to get some new switches installed, we would probably not know what day or time the tech to do the work was going to arrive.

ANYway...all that adds up to thousands of people coming in and messing with our physical network, at all hours of the day and night, all over the world, constantly.

Robust and automated discovery of the network was a must, and my team implemented that. The raw network discovery tool was called Drake, named after this guy: https://en.wikipedia.org/wiki/Francis_Drake and the tool that used many automatic and manual rules and heuristics to map the discovered networking devices to logical sites (ie, Store 1234, US) was called Atlas, named after this guy: https://en.wikipedia.org/wiki/Atlas_(mythology)

All of that background aside, the interesting story.

In the late 90s and early 2000s, Drake and Atlas were doing their thing, generally quite well and with only a fairly small amount of care and feeding required. I was snooping around and noticed that a particular site of type International Home Office had grown enormously over the course of a few years. When I looked, it had hundreds of network devices and tens of thousands of nodes. This was around 2001 or 2002, and at that time, I knew that only US Home Office sites should have that many devices, and thought it likely that Atlas had a 'leak'. That is, as Atlas did its recursive site mapping work, sometimes the recursion would expand much further than it should, and incorrectly map things.

After looking at the data, it all seemed fine. So I made some inquiries, and lo and behold, that particular international home office site had indeed been growing explosively.

The site's mapped name was completely unfamiliar to me, at the time at least. You might have heard of it: https://en.wikipedia.org/wiki/Shenzhen

I was seeing fingerprints in our network of WalMart's whole scale switch to sourcing from China.


In the early 2000s I was working as a field engineer installing/replacing/fixing network equipment for Walmart at all hours. It's pretty neat to hear the other side of the process! If I remember correctly there was some policy that would automatically turn off switch ports that found new, unrecognized devices active on the network for an extended period of time, which meant store managers complaining to me about voip phones that didn't function when moved or replaced.


Ah neat, so you were an NCR tech! (I peeked at your comment history a bit.) My team and broader department spent a lot of hours working with, sometimes not in the most friendly terms, people at different levels in the NCR organization.

You're correct, if Drake (the always running discovery engine) didn't detect a device on a given port over a long enough time, then another program would shut that port down. This was nominally done for PCI compliance, but of course having open, un-used ports especially in the field is just a terrible security hole in general.

In order to support legit equipment moves, we created a number of tools that the NOC and I believe Field Support could use to re-open ports as needed. I think we eventually made something that authorized in-store people could use too.

As an aside, a port being operationally 'up' wasn't by itself sufficient for us mark the port as being legitimately used. We had to see traffic coming from it as well.

You mentioned elsewhere that you're working with a big, legacy Perl application, porting it to Python. 99% of the software my team at WalMart built was in Perl. (: I'd be curious to know, if you can share, what company/product you were working on.


The NCR/Walmart relationship was fairly strained during my tenure. Given the sheer number of stores/sites that Walmart had and NCR's own problems, it was not always possible to provide the quality of service people might expect, especially with a smile. From a FE perspective, working on networking gear at Walmart meant that you were out at 11pm at night (typically after working 10-11 hours already and spending an hour or two onsite waiting for the part to arrive via courier) and your primary concern was to get the job done and get back home. The worst was to plug a switch in, watch it not power up, and realize you'd need to be back in the same spot at three or four hours later to try again.

Walmart must have been an interesting place to work during the late 90s, early 2000s - I imagine that most everywhere they had to solve problems at scale before scale was a considered a thing. I'd be very interested to see how the solutions created in that period match to best-practices today, especially since outside of the telecom or perhaps defense worlds there probably wasn't much prior art.

As for the Perl application, I probably shouldn't say since I'm still employed at the same company and I know coworkers who read HN. If you're interested, DM me and I can at least provide the company name and some basic details.


> The NCR/Walmart relationship was fairly strained ...

Definitely. (: I didn't hold it against the hands on workers like yourself. Even (and perhaps especially) back then, WalMart was a challenging, difficult and aggressive partner.

> working on networking gear at Walmart meant that you were out at 11pm at night

That sounds about right; the scheduling I was directly aware of was very fast paced. Our Network Engineering Store Team pushed and pushed and pushed, just as they were pushed and pushed and pushed.

> Walmart must have been an interesting place to work during the late 90s, early 2000

Yup. Nowhere I've worked before or since had me learning as much or getting nearly as much done. It was an amazingly positive experience for me and my team, but not so positive for a lot of others.

> I imagine that most everywhere they had to solve problems at scale before scale was a considered a thing.

Sometimes I imagine writing a book about this, because it's absolutely true, all over Information Systems Division.

For a time in the early 2000s, we were, on average, opening up a new store every day, and a typical new store would have two routers, two VOIP routers, two cache engines, between 10 and 20 switches, two or four wireless access point controllers and dozens of AP endpoints. That was managed by one or two people on the Network Engineering side, so my team (Network Management) wrote automation that generated the configs, validated connections, uploaded configs, etc etc etc. (Not one or two people per store: one or two people for ALL of the new stores.)

The networking equipment was managed by a level of automation that is pretty close to what one sees inside of Google or Facebook today, and we were doing it 20 years ago.

> ... telecom ... prior art ...

John Chambers, the long-time CEO of Cisco, was at the time on WalMart's board of directors. He was always a bit of a tech head, and so when he came to Bentonville for board meetings, he'd often come and visit us in Network Engineering.

Around 2001-2002, we were chatting with him and he asked why we weren't using Cisco Works to manage our network. https://en.wikipedia.org/wiki/Cisco_Prime but back then it was mostly focused on network monitoring and to a lesser extent, config management. We chuckled and told him that there's no way that Cisco Works could scale to even a fraction of our network. He asked what we used, and of course we showed him the management system we'd written.

He was so impressed that he went back to San Jose, selected a group of Cisco Works architects, had them sign NDAs, and sent them to Bentonville, Arkansas for a month. The intent was to have them evaluate our software with an eye toward packaging it up and re-selling it.

Those meetings were interesting, but ultimately fruitless. The Cisco Works architects were Ivory Tower Java People. The first thing they wanted to see was our class hierarchy. We laughed and said we had scores of separate and very shallow classes, all written in Perl, C and C++.

Needless to say, they found the very 'rough and ready' way our platform was designed to be shocking and unpalatable. They went back and told Chambers that there was literally no way our products could be tied together.

> ... match to best-practices today ...

Professionally, I've been doing basically the same kinds of things since then, and I'll say that while our particular methods and approaches were extremely unusual, the high level results would meet or perhaps exceed what one gets with 'best practices' seen today.

Not because we were any smarter or better, but because we had no choice but to automate and automate effectively. At that scale, at that rate of change, at those uptime requirements, 'only' automating 99% would be disastrous.


> Sometimes I imagine writing a book about this

FWIW, my brain was going "book! book! book! book! book!" back at the top-level comment, and the beeper may have got slightly overloaded and broke as I continued reading. :)

Yes please.

As a sidenote, the story about "the CEO vs the architects" was very fascinating: the CEO could see the end-to-end real-world value of what you'd built, but the architects couldn't make everything align. In a sense the CEO was more flexible than the architects, despite the fact that stereotypes might suggest the opposite being more presumable.

Also, the sentiment about your unusual methodology exceeding current best practice makes me wonder whether you achieved so-called "environmental enlightenment" - where everything clicks and just works and makes everyone who touches the system a 5x developer - or whether the environment simply had to just work really really well. Chances are the former is what everyone wishes they'll find one day, while the latter (incredibly complex upstream demands that are not going to go away anytime soon and which require you to simply _deliver_) definitely seems like the likelier explanation for why the system worked, regardless of the language it was written in - it was the product of a set of requirements that would not accept anything else.

Hmm. Now I think about that a bit and try and apply it to "but why is current best practice worse", I was musing the other day about how a lot of non-technical environments don't apply tech in smart ways to increase their efficiency, because their fundamental lack of understanding in technology means they go to a solutions provider, get told "this will cost $x,xxx,xxx", don't haggle because they basically _can't_, and of course don't implement the tech. I wonder if the ubiquitification (that seems to be a word) of so-called "best practices" in an area doesn't function in a similar way, where lack of general awareness/understanding/visibility in an area means methodology and "practices" (best or not) aren't bikeshed to death, and you can just innovate. (Hmm, but then I start wondering about how highly technically competent groups get overtaken by others... I think I'll stop now...)


Epic story! Thank you for sharing it. I appreciate the detail you included there.


You're quite welcome. In my experience, the right details often make a story far more interesting.


ovh monitoring is even better, everything is green on the destroyed dc http://status.ovh.com/vms/index_sbg2.html


That’s because it runs on the cloud… :-p


I love hearing stories from "old" Walmart. I was a Walmartian from 2017 to 2019, and I still miss my co-workers. (Shout-out to the Mobile Client team.)

Some interesting facts to know for those who don't dig into it. Walmart: - has 80+ internal apps, mostly variants but still unique - runs k8s inside of Distribution Centers - maintains a fleet of >180k mobile devices in the US alone - has a half-dozen data centers in the US - has most International infrastructure seperate from US Stores'

I've got some stories of my own, maybe I'll post them in a bit.


> has a half-dozen data centers in the US

Wow, that's a hell of a change! When I left in 2009, there were exactly two datacenters: NDC and EDC. Not surprising really.

From where I was sitting, the best era was definitely 1997-2004 or so. ISD really went down hill, pretty quickly, in my last five years there, for many different reasons.


Yeah, they opened up SDC, and added several "zones" as data centers as well.


Neat! Where did they build SDC?


The classic "lp0 on fire" error message comes to mind: https://en.wikipedia.org/wiki/Lp0_on_fire

Really though, I feel truly awful for anyone affected by this. The post recommends implementing a disaster recovery plan. The truth is that most people don't have one. So, let's use this post to talk about Disaster Recovery Plans!

Mine: I have 5 servers at OVH (not at SBG) and they all back up to Amazon S3 or Backblaze B2, and I also have a dedicated server (also OVH/Kimsufi) that gets the backups. I can redeploy in less than a day on fresh hardware, and that's good enough for my purposes. What's YOUR Disaster Recovery Plan?


I'm at OVH as well (in the BHS datacenter, fortunately). I run my entire production system on one beefy machine. The apps and database are replicated to a backup machine hosted with Hetzner (in their Germany datacenter). I also run a tiny VM at OVH which proxies all traffic to Hetzner. I use a failover IP to point at the big rig at OVH. If the main machine fails, I move the failover IP to the VM, which sends all traffic to Hetzner.

If OVH is totally down, and the fail over IP doesn't work, I have a fairly low TTL on the DNS.

I backup the database state to S3 every day.

Since I'm truly paranoid, I have an Intel NUC at my house that also replicates the DB. I like knowing that I have a complete backup of my entire business within arm's reach.


Are you me, by any chance? :-)

I also run our entire production system on one beefy machine at OVH, and replicate to a similar machine at Hetzner. In case of a failure, we just change DNS, which has a 1 hour TTL. We've needed to do an unplanned fail-over only once in over 10 years.

And like you, I have an extra replica at the office, because it feels safe having a physical copy of the data literally at hand.


Same but with a regular offline physical copy (cheap nas). One of my worries is a malicious destruction of the backups if anything worms its way in my network


Which is why "off" is still a great security tool. A copy on a non-powered device, even if that device is attached to the network, is immune to worms. There is something to be said for a NAS solution that requires a physical act to turn on and perform an update.


hetzner has storage boxes and auto snapshots. so even if someone deletes the backups remotely there are still snapshots which they can't get unless they have control panel access.


My threat model is someone that would have full access to my computer without me knowing. So they could over time get access to passwords, modify my OS to MITM yubikeys... Over cautious really likely, but that doesnt cost me much more


> someone that would have full access to my computer without me knowing

What's most relevant, in your case, would you say? Evil maiden attack, or browser zero day, or spear phishing? An insider? Or something else?

(And how did you arrive at that threat model, if i may ask)


Browser zero day or some kind of malicious linux package that gets distributed mostly. Don't think i've a profile that would make people bother to do physical attacks.


Ok. Me too, plus malicious software dependences


Not done any research into it, but I always thought OVH was supposed to be a very budget VPS service primarily for personal use rather than business. Although thought it was akin to having a Raspberry Pi plugged in at home.

Again, I may be completely wrong but why would you not use AWS/GCP? Even if it's complexity, Amazon have Lightsail, or if it's cost I thought DigitalOcean was one of the only reputable business-grade VPS providers.

I just can't imagine many situations where a VPS would be superior to embracing the cloud and using cloud functions, containers, instances with autoscaling/load balancers etc.


You cant imagine it yet big chunk of the independent internet runs on small vps servers. There isnt much difference between DO and OVH, Hetzner, Vultr, Linode... not sure why DO would be better. I mean its US company doing marketing right. Thats the difference. Plus ovh/hetzner have only EU locations.

I think small bussinesses like smaller simple providers instead of bigclouds. Its different philosophy if you are afraid of extreme centralisation of internet it makes sense.


I can think of a lot of big differences. For one you can get much larger machines at OVH and Hetzner with fancy storage configurations for your database if desired (e.g. Optane for your indices, magnetic drives for your transaction log, and raided SSDs for the tables)

They also don't charge for bandwidth, although some of those other providers have a generous free bandwidth and cheap overage.


> Optane for your indices

At OVH? If so, their US data centers don't seem to have that option.

Not that I need it. The largest database I run could easily fit in RAM on a reasonably sized dedicated box.


I didn't realize they had US datacenters before now. It's possible that's no longer an option. It was on the largest servers in the Montreal datacenter when I specced that out.


they have 2 data centers in the US


So you are saying they might be even better than DO depinding on requirements.

I didnt know.


Much cheaper and better performance at the high end. Doesn't compete at all at the low-end, except through their budget brand Kimsufi. I don't see them really as targeting the same market.


I rent a server from OVH for $32 a month. It's their So You Start line... doesn't come with fancy enterprise support and the like.

It's a 4 core 8 thread Xeon with 3x 1TB SATA with 32GB of ECC RAM IIRC (E3-SAT-1-32, got it during a sale with a price that is guaranteed as long as I keep renewing it)

The thing is great, I can run a bunch of VM's on it, it runs my websites and email.

Overall to get something comparable elsewhere I would be paying 3 to 4 times as much.

I would consider $50 a month or less low end pricing. ¯\_(ツ)_/¯


Yeah, I forgot they also have the so you start brand. It's probably more expensive than the majority of what digital ocean sells, but there is some overlap for sure.


I don't know about OVH but Hetzner beats DO at the lower end: for $5/month you get 2 CPUs vs 1, 2 GB RAM vs 1, 40 GB disk vs 25 and 20 TB traffic vs 1. They have an even lower-end package for 2.96 Euro/month as well.


OVH is not Europe-only, it has datacenters in America, Asia and Australia[1].

[1] https://www.ovh.com/world/us/about-us/datacenters.xml


OVH has at least one large North American datacenter in Beauharnois, located just south of Montreal. I've used them before for cheap dedicated servers. They may have others.


Yes ididnt know and i was generalizing too much.

But i assume they are less known in US.


OVH has a location in Canada, now.


AWS is a total and utter ripoff compared to the price/performance, DDoS protection & unmetered bandwidth provided by OVH.


If all you need is compute, storage, and a pipe, all the big cloud providers are a total ripoff and you should look elsewhere. The big ones only make sense if you are leveraging their managed features or if you need extreme elasticity with little chance of a problem scaling up in real time.

OVH is one of the better deals for bare metal, but there are even better ones for bandwidth. You have to shop around a lot.

Also be sure you have a recovery plan... even with the big providers. These days risks include not only physical stuff but some stupid bot shutting you off because it thinks you violated TOS or is reacting to a possibly malicious complaint.

We had a bot at AWS gank some test systems once because it thought we were cryptocurrency mining with free credits. We weren’t, but we were doing very CPU intensive testing. I’ve heard of this and worse happening elsewhere. DDOS detector and IDS bots are particularly notorious.


OVH is a European equivalent to Digital Ocean.

It has twice the revenue, and is the third largest hosting provider in the world.


I would call it the EU Rackspace version. Except that it's not like insane Rackspace price.


> OVH is a European equivalent to Digital Ocean.

I've been making this point for a long time. Both of those AS spaces are legendary for the volume of dodgy packets they barf at the rest of us.

drop

drop

drop


Twice the revenue of DigitalOcean still puts it < $1B ARR, or am I missing something? I can’t see how that’s the third largest in the world, or does your definition of “hosting provider” exclude clouds?


I took it from the top of their Wikipedia page.

In any case, they aren't "primarily for personal use".

https://en.wikipedia.org/wiki/OVH


OVH is one of the largest providers in the world. They run a sub brand for personal use (bare metal for $5/m, hardware replacements in 30 min or less usually).

..and they do support all of those things you just listed, not just API-backed bare metal.


Is that a typo? I only see OVH bare metal starting at >$50. How could a provider offer a bare metal server for $5?


OVH has new servers.

Their sub-brand soyoustart has older servers (that are still perfectly fine), roughly E3 Xeon/16-32GB/3x2TB to 4x2TB for $40/m ex vat.

Their other sub brand kimsufi for personal servers has Atom low-power bare metal with 2TB HDD (in reality it is advertised 500GB/1TB, but they don't really have any of those in stock left, if your drive fails they replace it with a 2T - so far this has been my exp) for $5.

All of this is powered by automation, you don't really get any support and you are expected to be competent. If your server is hacked you get PXE-rebooted into a rescue system and can scp/rsync off your contents before your server is reinstalled. OS installs, reboots, provisioning are all automated, there's essentially no human contact.

PS: Scaleway, in Paris, used to offer $2 bare metal (ultra low voltage, weaker than an Atom, 2GB ram), but pulled all their cheap machines, raised prices on existing users, and rebranded as enterprisey. The offer was called 'kidechire'

--

It is kind of interesting that on the US side everyone is in disbelief, or like "why not use AWS" - while most of the European market knows of OVH, Hetzner, etc.

My own reason for using OVH? It's affordable and I would not have gotten many projects (and the gaming community I help out with) off the ground otherwise. I can rent bare metal with NVMe, and several terabytes of RAM for less than my daily wage for the whole month, and not worry about per-GB billing or attacks. In the gaming world you generally do not ever want to use usage based billing - made the mistake of using Cloudfront and S3 once and banned script kiddies would wget-loop the largest possible file from the most expensive region botnet repeatedly in a money-DoS.

I legitimately wouldn't have been able to do my "for-fun-and-learning" side projects (no funding, no accelerator credits, ...) without someone like them. The equivalent of a digitalocean $1000/m VM is about $100 on OVH.


Just as a side note, the name "kimsufi" comes from the French "qui me suffit", which roughly translate to "enough for me".


Ha! You can see they localise it too https://www.ovh.com/world/support/terms-and-conditions/

Like kimsufi equivalent brand "isgenug"


I like scalingo too. If you need a bit more, they have DBaaS, APP Containers and Networking.


Scalingo is €552.96/m for 16GB of memory.

32c xeon/256GB ECC/500GB SSD 8TB HDD is $100/m at OVH. The difference is amusing.


you're comparing a PaaS with a piece of hardware. It's absolutely not comparable.

Yann, CEO at Scalingo


Kimsufi has tiny cheap Atom servers with self built or made to order racks and hardware:

(This is 2011, I think it looks fancier now)

https://lafibre.info/ovh-datacenter/data-center-ovh-roubaix-...

Edit: Seems like they stopped publishing videos for that datacenter, but this seems to be a video for the burn down datacenter in 2013: https://www.youtube.com/watch?v=Y47RM9zylFY


It's not a typo. OVH runs Kimsufi, which has bare metal servers for as low as 5$. It is pretty insane.


Thank you. TIL!


OVH STARTED as a budget VPS service some 20 years ago... but they grew a lot since 6-7 years, adding more "cloud" services and capabilities, even not on par with the main players...

Why not use AWS/GCP? From my personal point of view: as a French citizen, I'm more and more convinced that I can't completly trust the (US) big boys for my own safety. Trump showed that "US interest" is far more important than "customer interest" or even "ally interest". And moreover, Google is showing quite regurlaly that it's not a reliable business partner (AWS look better for this).


price, also elaborate hardware customizations are not possible and then you are still running on hypervisor vs baremetal.


> Google is showing quite regurlaly that it's not a reliable business partner

Interesting, any examples?


I'm not the OP, but I'd imagine it is the combination of no support line and algorithmic suspension of business accounts. It is a relevant risk.


Yeah, I was thinking about all the horror stories that can be found on this site.

As a customer (or maybe an "involontary data provider"), I do as much as I can to avoid Google to be my SPOF, not technically (it's really technically reliable) but on the business side. I had to setup my own mail server just to avoid any risk of google-ban for example... just in case. I won't use Google authentificator for the same reason. I'm happy to have left Google Photos some years ago, to avoid problems of Google shutting it down. And the list could go on...

As a business, I like to program Android apps but the Google Store is really a risk too. Risk to have any Google account blacklisted because some algorithm thought I did something wrong. And no appeal.

Maybe all this doesn't apply to GCP customers. Maybe GCP customers have a human direct line, with someone to really help and the capacity to do it. Or maybe it's just Google: as long as it work, enjoy. If it doesn't, go to (algorithmic) hell.


Nope. I was at a company with a $1M dedicated spend contract w/ GCP and what that got us was support through a VAR. It then became the VAR’s job to file support tickets that took two weeks to get the response “oh well that’s now how we do it at Google. Have you read these docs you already said you read and can you send logs you already sent?” instead of my job to do that.


no, OVH has dedicated servers, lot of big companies use it to build out private clouds, much cheaper then amazon or google


Are you truly paranoid?

If my money and/or job depended on having something running without (or with minimal) disruption I would be as paranoid as you, too.

BTW - Some people call this business recovery plan, not plain paranoia ;-)


Enterprise-level projects often have only light protection against wrongful hosting account termination, reasoning that spending a lot of money and having an account manager keeps them safe from clumsy automated systems.

So they might have their primary and replica databases at different DCs from the same hosting provider, and only their nightly backup to a different provider. Four copies to four different providers is a step above three copies with two providers!

A large enterprise would probably be using a filesystem with periodic snapshots, or streaming their redo log to a backup, to protect against a fat-fingered DBA deleting the wrong thing. Of course, filesystem snapshots provide no protection against loss of DC or wrongful hosting account termination, so you might not count them as true backup copies.


This is why you should have a “Cloud 3-2-1” backup plan. Have 3 copies of your data, two with your primary provider, and 1 with another.

e.g., if you are an AWS customer, have your back ups in S3 and use simple replication to sync that to either GCS or Azure, where you can get the same level of compliance attestation as from AWS.


It's not paranoia if you're right. All of the risks GP is protecting against are things that happen to someone every day, and they should be seen like wearing the seat belt in a car.


I have a reliability and risk avoidance mindset, but I’ve had to stand back because my mental gas tank for trying to keep things going is near empty.

I’ve really struggled working with others that either are both ignorant and apathetic about the business’s ability to deal with risk or believe that it’s their job to keep putting duct tape over the duct tape that breaks multiple times a day while users struggle.

I like seeing these comments reminding others to a wear seat belt or have backups for their backups, but I don’t know whether I should care more about reliability. I work in an environment that’s a constant figurative fire.

I also like to spend time with my family. I know it’s just a job, and it would be even if I were the only one responsible for it; that doesn’t negate the importance of reliability, but there is a balance.

If you are dedicated to reliability, don’t let this deter you. Some have a full gas tank, which is great.


Consider:

> ... [F]inance is fundamentally about moving money and risk through a network. [1]

Your employer has taken on many, many risks as part of their enterprise. If every risk is addressed the company likely can’t operate profitably. In this context, your business needs to identify every risk, weigh the likelihood and the potential impact, decide whether to address or accept the risk, and finally, if they decide to address the risk, whether to address it in-house our outsource it.

You’ve identified a risk that is currently being “accepted” by your employer, one that you’d like to address in-house. Perhaps they’ve taken on the risk unintentionally, out of ignorance.

As a professional the best I can do is to make sure that the business isn’t ignorant about the risk they’ve taken on. If the risk is too great I might even leave. Beyond that I accept that life is full of risks.

[1] Gary Gensler, “Blockchain and money”, Introduction https://ocw.mit.edu/courses/sloan-school-of-management/15-s1...


This resonates with me. I notice my gas tank rarely depletes because of technology. It doesn’t matter how brain dead the 00’s oracle forms app with absurd unsupported EDI submission excel thinga-ma-bob that requires a modem ... <fill in the rest of the dumspter fire as your imagination deems>. Making a tech stack safe is a fun challenge.

Apathetic people though, that can be really tough going. It’s just that way “because”. Or my favourite “oh we don’t have permission to change that”, how about we make the case and get permission? _horrified looks_ sometimes followed by pitch forks.


Reliability is there to keep your things running smoothly during normal operations. Backups are there for when you reach the end of your reliability rope. Neither is really a good replacement for the other. The most reliable systems will still fail eventually, and the best of backups can't run your day to day operations.

At the end of the day you have a budget (of any kind) and a list of priorities on which to spend it. It's up to you or your management to set a reasonable budget, and to set the right priorities. If they refuse, leave or you'll just burn the candle at both ends and just fade out.


Backups are a reliability tool, yes.

A backup on its own is of little worth if unused.

When a backup is used to re-enable something, then the amount of time disabled may be decreased. When it is, this is reliability- we keep things usable and in function, more than not.


Bacula has some really cool features for cloud backups.

https://bacula.org


Ow what an unfortunate name.

https://en.wikipedia.org/wiki/Baculum


It just means "stick" in the original Latin!


but fortunately also within arm's reach


LGTM


It seems your setup follows the three rules back-ups with at least two in different physical location. https://www.nakivo.com/blog/3-2-1-backup-rule-efficient-data...


are your domains at ovh too ? If yes, I'd consider changing this: this morning the manager was quite flooded and the DNS service was down for some time...


Yet another cheer for Hetzner!


this is the way. I do the same. Not paranoid at all.


> I like knowing that I have a complete backup of my entire business within arm's reach.

It could also provide a burglar a fantastic opportunity to pivot into career in data breaches.


For small firms, CEO / CTO maintaining off-sites at a residence is reasonable and not an uncommon practice. As with all security / risk mitigation practices, there is a balance of risks and costs involved.

And as noted, encrypted backups would be resistant to casual interdiction, or even strongly-motivated attempts. Data loss being the principle risk mitigated by off-site, on-hand backups.


This problem is usually solved through encryption.


If I were to ask my CISO if I was allowed to bring the production database home, I’m pretty sure his answer wouldn’t be “as long as you encrypt it”.


That's because he doesn't trust you with this data. That has nothing to do with encryption safety.

There is nothing magical about data centers making them safe while your local copy isn't.


> There is nothing magical about data centers making them safe while your local copy isn't.

Is this a serious comment? My house is not certified as being compliant with any security standards. Here's the list that the 3rd party datacenter we use is certified as complaint with:

https://aws.amazon.com/compliance/programs/

The data centers we operate ourselves are audited against several of those standards too. I guess you're right that there's nothing magic about security controls, but it has nothing to do with trust. Sensitive data should generally never leave a secure facility, outside of particularly controlled circumstances.


Of course it's serious.

You are entierly missing the point by quoting the compliance programs followed by AWS whose sole business is being a third party hoster.

For most business, what you call sensitive data is customers and orders listing, payment history, inventory if you are dealing in physical goods and HR related files. These are not state secrets. Encryption and a modicum of physical security go a long way.

I personally find the idea that you shouldn't store a local backup of this kind of data out of security concern entirely laughable. But that's me.


This is quite a significant revision to your previous statement that there’s nothing about a data center that makes it more secure than your house.

This attitude that your data isn’t very important, so it’s fine to not be very concerned about it’s security, while not entirely uncommon, is something most organisations try to avoid when choosing vendors. It’s something consumers are generally unconcerned about, until a breach occurs, and The Intercept write an article about it. At which point I’m sure all the people ITT who are saying it’s fine to take your production database home would be piling on with how stupid the company was for doing ridiculous things like taking a copy of their production database home.


> This is quite a significant revision to your previous statement that there’s nothing about a data center that makes it more secure than your house.

I said there was nothing magical about data centers security, a point I stand with.

It's all about proper storage (encryption) and physical security. Obviously, the physical security of an AWS data center will be tighter that your typical SME but in a way which is of no significance to storing backups.

> This attitude that your data isn’t very important

You are once again missing the point.

It's not that your data isn't important. It's that storing it encrypted in a sensible place (and to be clear by that I just mean not lying around - a drawer in an office or your server room seems perfectly adequate to me) is secure enough.

The benefits of having easily available backups by far trump the utterly far fetched idea that someone might break into your office to steal your encrypted backups.


> It's that storing it encrypted in a sensible place (and to be clear by that I just mean not lying around - a drawer in an office or your server room seems perfectly adequate to me) is secure enough.

In the SME space some things are "different", and if you've not worked there it can be hard to get one's head around it:

A client of mine was burgled some years ago.

Typical small business, offices on an industrial estate with no residential housing anywhere nearby. Busy in the daytime, quiet as the grave during the night. The attackers came in the wee small hours, broke through the front door (the locks held, the door frame didn't), which must have made quite a bit of noise. The alarm system was faulty and didn't go off (later determined to be a 3rd party alarm installer error...)

All internal doors were unlocked, PCs and laptops were all in plain sight, servers in the "comms room" - that wasn't locked either.

The attacker(s) made a cursory search at every desk, and the only thing that was taken at all was a light commercial vehicle which was parked at the side of the property, its keys had been kept in the top drawer of one of the desks.

The guy who looked after the vehicle - and who'd lost "his" ride - was extremely cross, everyone else (from the MD on downwards) felt like they'd dodged a bullet.

Physical security duly got budget thrown at it - stable doors and horses, the way the world usually turns.


But how many of those splashy breaches ended up being because of the off-site backup copy of the database at the CEO's house?


Once you're big enough to afford a CISO, you're likely big enough to afford office space with decent physical security to serve as a third replicated database site to complement your two datacenters.

These solutions are not one-size-fits-all. What works for a small startup isn't appropriate for a 100+ person company.


Yes, I agree. Small companies typically are very bad at security.


Not in my experience. Worked at some small shops that were lightyears ahead in terms of policy, procedures and attitude compared to places I've worked with 50k+ employees globally.


Large organisations tend not to achieve security compliance with overly sophisticated systems of policy and controls. They tend to do it using bureaucracy, which while usually rather effective at implementing the level of control required, will typically leave a lot to be desired in regards to UX and productivity. Small organisations tend to ignore the topic entirely until they encounter a prospective client or regulatory barrier that demands it. At which point they may initially implement some highly elegant systems. Until they grow large enough that they all devolve into bureaucratic mazes.


I'm aware, but that's not been my experience. I've been in large places where there's been a lassiez faire attitude because it was "another team's job" and general bikeshedding over smaller features because the bigger picture security wasn't their area or was forced from a dictat from above to use X because they're on the board, whilst X is completely unfit for purpose. There's no pushback. However I've worked at small ISPs where we took security extremely seriously. Appropriate background check and industry policy but moreso the attitude... we wanted to offer customers security because we had pride in our work.


I am really good at security and I too keep encrypted backups on-site in my house.


Well, it's not because the encryption is insecure.


It all depends on your paranoia level of data hacking burglars vs. vaporized data centers.


LUKS is your friend.


But it does protect somewhat against ransomware on the servers.


If you are a corporate entity of some kind, the final layer of your plan should always be "Go bankrupt". You can't successfully recover from every possible disaster and you shouldn't try to. In the event of a sufficiently unlikely event, your business fails and every penny spent attempting the impossible will be wasted, move on and let professional administrators salvage what they can for your creditors.

Lots of people plan for specific elements they can imagine and forget other equally or even more important things they are going to need in a disaster. Check out how many organisations that doubtless have 24/7 IT support in case a web server goes down somehow had no plan for what happens if it's unsafe for their 500 call centre employees to sit in tiny cubicles answering phones all day even though pandemic respiratory viruses are so famously likely that Gates listed them consistently as the #1 threat.


"Go bankrupt" is not a plan. Becoming insolvent might be the end result of a situation but it's not going to help you deal with it.

Let's take an example which might lead to bankruptcy. A typical answer to a major disaster (let's say your main and sole building burning as a typical case) for an SME would be to cease activity, furlough employes and stop or defer every payments you can while you claim insurance and assess your options. Well, none of these things are obvious to do especially if all your archive and documents just burnt. If you think about it (which you should), you will quickly realise that you at least need a way to contact all your employes, your bank and your counsel (which would most likely be the accountant certifying your results rather than a lawyer if you are an SME in my country) offsite. That's the heart of disaster planning: having solutions at the ready for what was easy to foresee so you can better focus on what wasn't.


> "Go bankrupt" is not a plan.

Yes it is. (Though it's better, as GP suggested, as a final layer of a plan and not the only layer.)

> Becoming insolvent might be the end result of a situation but it's not going to help you deal with it.

Insolvency isn't bankruptcy. Becoming insolvent is a consequence, sure. Bankruptcy absolutely does help you deal with that impact, that's rather the point of it.


Hence the "final layer" statement.

Bankruptcy when dealt with correctly is a process not an end.

If everything else fail it's better to fill for bankruptcy when there is still something to recover with help of others than to burn everything to ashes because of your vanity.

At least that's how I understood parent's comment.


As a quick interlude, since this may be confusing to non-US readers: bankruptcy in the United States in the context of business usually refers to two concepts, whereas in many other countries it refers to just one.

There are two types of bankruptcies in the US used most often by insolvent businesses: Chapter 7, and Chapter 11.

A Chapter 7 bankruptcy is what most people in other countries think of when they hear "bankruptcy" - it's the total dissolution of a business and liquidiation of its assets to satisfy its creditors. A business does not survive a Chapter 7. This is often referred to as a "bankruptcy" or "liquidation" in other countries.

A Chapter 11 bankruptcy, on the other hand, is a process by which a business is given court protection from its creditors and allowed to restructure. If the creditors are satisfied with the reorganisation plan (which may include agreeing to change the terms of outstanding debts), the business emerges from Chapter 11 protection and is allowed to continue operating. Otherwise, if an agreement can't be reached, the business may end up in Chapter 7 and get liquidated. Most countries have an equivalent to a Chapter 11, but the name for it varies widely. For example, Canada calls it a "Division 1 Proposal," Australia and the UK call it "administation," and Ireland calls it "examinership."

Since there's a lot of international visitors to HN I just thought I'd jump in and provide a bit of clarity so we can all ensure we're using the same definition of "bankruptcy." A US Chapter 7 bankruptcy is not a plan, it's the game over state. A US Chapter 11 bankruptcy, on the other hand, can definitely be a strategic maneuver when you're in serious trouble, so it can be part of the plan (hopefully far down the list).


This helps a lot, thanks. I think most people international would assume bankruptcy = game over.


Yes, I for one, was confused.

I wondered why you should plan an event which will "destroy" your company anyways.


> Bankruptcy when dealt with correctly is a process not an end.

Yes, that's why "Go bankrupt" is not a plan which was the entire point of my reply. That's like saying that your disaster recovery plan is "solve the disaster".


Going bankrupt is a plan. However, it is a somewhat more involved one than it sounds, at first. That's why there should be a corporate lawyer advising on stuff like company structure, liabilities, continuance of pension plans, ordering and reasons for layoffs, etc.


It's not quite that simple, the data you might have may be needed for compliance or regulatory reasons. Having no backup strategy might make you personally liable depending on the country!


IMHO, the part they had no plan for was being unable to just require their employees to come in anyway...


The more insecure your workers, the easier it is to get them to come in, regardless of what the supposed rules may or may not be.

Fast Fashion for example often employs workers in more or less sweatshop conditions close to the customers (this makes commercial sense, if you make the hot new items in Bangladesh you either need to expensively air freight them to customers or they're going to take weeks to arrive after they're first ordered - there's a reason it isn't called "Slow fashion"). These jobs are poorly paid, many workers have dubious right-to-work status, weak local language skills, may even be paid in cash - and so if you tell them they must come in, none of them are going to say "No".

In fact the slackening off in R for the area where my sister lives (today the towering chimneys and cavernous brick factories are just for tourists, your new dress was made in an anonymous single story building on an industrial estate) might be driven more by people not needing to own new frocks every week when they've been no further than their kitchen in a month than because it would actually be illegal to staff their business - if nobody's buying what you make then suddenly it makes sense to take a handout from the government and actually shut rather than pretend making mauve turtleneck sweaters or whatever is "essential".


For non uk residents "Frock" is a regional term for dress - quite common in the Midlands.


Just to clarify: trans-atlantic shipments take a week port-to-port, e.g. Newark, NJ, USA to Antwerp, Belgium. (Bangladesh to Italy via Suez-channel looks like a 2-week voyage, or 3 weeks to the US west coast. Especially the latter would probably have quite a few stops on the way along the Asian coast.) You get better economics than shipping via air-freight from one full pallet and up. Overland truck transport to and from the port is still cheaper than air freight, at least in the US and central Europe.

For these major routes, there are typically at least bi-weekly voyages scheduled, so for this kind of distance, you can expect about 11 days pretty uniformly distributed +-2 days, if you pay to get on the next ship.

This may lead to (committing to) paying for the spot on the ship when your pallet is ready for pickup at the factory, not when it arrives at the port) and use low-delay overland trucking services. Which operate e.g. in lockstep with the port processing to get your pallet on the move within half a day of the container being unloaded from the ship, ideally having containers pre-sorted at the origin to match truck routes at the destination. So they can go on a trailer directly from the ship and rotate drivers on the delivery tour, spending only a few minutes at each drop-off.

Because those can't rely on customers to be there and get you unloaded in less than 5 minutes, they need locations they can unload at with on-board equipment. They'd notify the customer with a GPS-based ETA display, so the customer can be ready and immediately move the delivery inside. Rely on 360-degree "dashcam" coverage and encourage the customer to have the drop-off point under video surveillance, just to easily handle potential disputes. Have the delivery person use some suitable high-res camera with a built-in light to get some full-surface-coverage photographic evidence of the condition it was delivered in.

I'd guess with a hydraulic lift on the trailer's back and some kind of folding manual pallet jack stuck on that (fold-up) lift, so they drive up to the location, unlock the pallet jack, un-fold the lift, lower the lift almost to the ground, detach the pallet jack to drop it the last inch/few cm to the ground, pull the jack out, lower the lift the rest of the way, drive it on to the lift, open the container, get up with the pallet jack, drive the pallets (one-by-one) for this drop-off out of the container and leave them on the ground, close and lock the container, re-arm the jack's hooks, shove it jack back under the slightly-lowered folding lift, make it hook back in, fold it up, lock the hooking mechanism (against theft at a rest stop (short meal and toilet breaks exist, but showering can be delayed for the up to 2 nights)), fold it all the way up, and go on to drive to their next drop-off point.


Just lobby the government to put call centers in "essential services". In my state they are open even with a partial lockdown.


The final layer is call the insurance company.


Not really, the insurance won't make things right in an instant. They will usually compensate you financially, but often only after painstaking evaluation of all circumstances, weighing their chances in court to get out of paying you and maybe a lengthy court battle and a race against your bankruptcy.

So yes, getting insurance can be a good idea to offset some losses you may have, as long as they are somewhat limited compared to your companies overall assets and income. But as soon as the insurance payout matches a significant part of your net worth, the insurance might not save you.


Fair enough.


There are always uninsurable events and for large enough companies/risks there are also liquidity limits to the size of coverage you can get from the market even for insurable events.

As such, it makes sense to make the level of risk you plan to accept (by not being insured against it and not mitigating) a conscious economic decision rather than pretending you've covered everything.


As long as you have outside shareholders you can decide that. If you do you'd be surprised about how they will respond to an attitude like that. After all: you can decide the levels of risk that you personally are comfortable with leading to extinguishing of the business, but a typical shareholder is looking at you to protect their investment and not insuring against a known risk which at some point in time materializes is an excellent way to find yourself in the crosshairs of a minority shareholder lawsuit against a (former) company executive.


In my work life I am a professional investor, so I've been through the debate on insure/prepare or not many times. It's always an economic debate when you get into "very expensive" territory (cheap and easy is different obviously).

The big example of this which springs to mind is business interruption cover - it's ruinously expensive so it's extremely unusual to have the max cover the market might be prepared to offer. It's a pure economic decision.


Yes, but it is an informed decision and typically taken at the board level, very few CEO's that are not 100% owners would be comfortable with the decision to leave an existential risk uncovered without full approval of all those involved, which is kind of logical.

Usually you'd have to show your homework (offers from insurance companies proving that it really is unaffordable). I totally get the trade-off, and the fact that if the business could not exist if it was properly insured that plenty of companies will simply take their chances.

We also both know that in case something like that does go wrong everybody will be looking for a scapegoat, so for the CEO's own protection it is quite important to play such things by the book, on the off chance the risk one day does materialize.


Absolutely - but that's kind of my point. You should make the decision consciously. The corporate governance that goes around that is the company making that decision consciously.


And this is the heart of the problem: a lot of times these decisions are made by people who shouldn't be making them or they aren't made at all, they are just made by default without bring the fact that a decision is required to the level of scrutiny normally associated with such decisions.

This has killed quite a few otherwise very viable companies, it is fine to take risks as long as you do so consciously and with full approval of all stakeholders (or at least: a majority of all stakeholders). Interesting effects can result: a smaller investor may demand indemnification, then one by one the others also want that indemnification and ultimately the decision is made that the risk is unacceptable anyway (I've seen this play out), other variations are that one shareholder ends up being bought out because they have a different risk appetite than the others.


It's true: most companies do not have a disaster recovery plan, and many of them confuse a breach protocol with a disaster recovery plan ('we have backups').

Fires in DCs aren't rare at all, I know of at least three, one of those in a building where I had servers. This one seems to be worse than the other two. Datacenters tend to concentrate a lot of flammable stuff, throws a ton of current through them and does so 24x7. The risk of a fire is definitely not imaginary, which is why most DCs have fire suppression mechanisms. Whether those work as advertised depends on the nature of the fire. An exploding on prem transformer took out a good chunk of EV1's datacenter in the early 2000's, and it wasn't so much the fire that caused problems for their customers, but the fact that someone got injured (or even died, I don't recall exactly), and before the investigation was completed and the DC released to the owners again took a long time.

Being paranoid and having off-site backups is what allowed us to be back online before the fire was out. If not for that I don't know if our company would have survived.




And here's some more from firefighters, while it was burning:

https://twitter.com/xgarreau/status/1369559995491172354

Looks glowing red to me.


Nobody hurt. That's a bit of good news.


is that entirely cargo containers? is that common for a data center?


No, SBG2 was a building in the "tower design", as is SBG3 behind it. The container in the foreground are SBG1 from the time when OVH didn't know if Straßburg is going to be a permanent thing.


https://en.wikipedia.org/wiki/Modular_data_center

Container DCs were a big thing for a while. Even Google did a whole PR thing about how they used them.


Funnily enough, I think it was the fire risk that caused them to ditch the idea and move to their current design. Though I know modular design is highly likely to be used by all players as edge nodes spring up worldwide.


It was also that the container had literally no advantages. It was just a meme that did not survive rational analysis. The building in which the datacenter is located is the simplest, cheapest part of the design. Dividing it up into a bunch of inconveniently-sized rectangles solves nothing.


Uff... it looks like half of the containers on this picture were on fire...


Are you making a joke about docker containers or am I missing something?


Part of OVH datacenters are literal, physical containers with racks, power supply, vents, etc.

You can see more details here: https://baxtel.com/data-center/ovh-strasbourg-campus


Do the containers have fire suppressant systems installed?


This looks like yet another aluminium composite panel fire...


Got burned once (no pun intended), learned my lesson.

Hot spare on a different continent with replicated data along with a third box just for backups. The backup box gets offsite backups held in a safe with another redundant copy in another site in another safe.

Restores are tested quarterly.

Keep backups of backups. Once bitten, twice shy.


> Restores are tested quarterly.

Probably this is the most important part of your plan. It's not the backup that matters; it's the restore. And if you don't practice it from time to time, it's probably not going to work when you need it.


> Hot spare on a different continent

Just be cautious about data locality laws (not likely to affect you as joe average, more for businesses)


A few years ago I worked on the British Telecom Worldwide intranet team and we had a matrix mapping various countries encryption laws.

This was so we remained legal in all of the countries BT worked in which required a lot of behind the scenes work to make sure we didn't serve "illegaly encypted" Data.


yeah, there's lots of countries with regulations that certain data can't leave the geographical boundary of the country. Often, it is the most sensitive data.


These laws generally don't work how people think they do.

For example, the Russian data residency law states that a copy of the data must be stored domestically, not that it can't be replicated outside the country.

The UAE has poorly written laws that have different regulations for different types of data - including fun stuff like only being subject to specific requirements if the data enters a 270 acre business park in Dubai.

Don't even get me started on storing encrypted data in one country and the keys in another...


Have you been bitten, personally? If so, story time?


Also stupid things not to forget: make sure your dns provider is independent otherwise you won’t be able to point to your new server (or have a secondary DNS provider). Make sure any email required for 2FA or communicating with your hosting service managing your infrastructure isn’t running on that same infrastructure.


I am literally in SBG2 so that has been fun.

Turns out, our disaster recovery plan is pretty good.

Datacenter burned down and I still was up 4 hours later in another data center with zero data loss. Good times.


And how's SBG1 doing?


About 1/3 destroyed


We test rolling over the entire stack to another AWS DR region (just one we dont normally use) from S3 backups, etc. We do this annually and try to introduce some variations to the scenarios. It takes us about 18 hours realistically.

Documentation / SOPs that have been tested thoroughly by various team members are really important. It helps work out any kinks in interpretation, syntax errors etc.

It does feel a little ridiculous at the time for all the effort involved, but incidents like this show why it's so important.


At work, there are several layers.

As an immediate plan, the 2-3 business critical systems are replicating their primary storages to systems in a different datacenter. This allows us to kick off the configuration management in a disaster, and we need something in between 1-4 hours to setup the necessary application servers and middlewares to get critical production running again.

Regarding backups, backups are archived daily to 2 different borg repo hosts on different cloud providers. We could lose an entire hoster to shenanigans and the damage would be limited to ~2 days of data loss at worst. Later this year, we're also considering to export some of these archives to our sister team, so they can place a monthly or weekly backup on tape in a safe in order to have a proper offline backup.

Regarding restores - there are daily automated restore tests for our prod databases, which are then used for a bunch of other tests after anonymization. On top, we've built most database handling on top of the backup/restore infra in order to force us to test these restores during normal business processes.

As I keep saying, installing a database is not hard. Making backups also isn't hard. Ensuring you can restore backups, and ensuring you are not losing backups almost regardless of what happens... that's hard and expensive.



In my case:

* All my services are dockerized and have gitlab pipelines to deploy on a kubernetes cluster (RKE/K3s/baremetal-k8s)

* git repo's containing the build scripts/pipelines are replicated on my gitlab instance and multiple work computers (laptop & desktop)

* Data and databases are regularly dumped and stored in S3 and my home server

* Most of the infrastructure setup (AWS/DO/Azure, installing kubernetes) is in Terraform git repositories. And a bit of Ansible for some older projects.

Because of the above, if anything happens all I need to restore a service is a fresh blank VM/dedicated machine or a cloud account with a hosted Kubernetes offering. From there it's just configuring terraform/ansible variables with the new hosts and executing the scripts.


How often do you test starting from a clean slate?


I have a similar setup: I recreate everything for every major kubernetes update.


My disaster recovery plan: we shall rebuild.


Finally we get to rewrite everything from scratch!


The spice will flow, and the tech debt will go!


F


One of my backup servers used to be in the same datacenter as the primary server. I only recently moved it to a different host. It's still in the same city, though, so I'm considering other options. I'm not a big fan of just-make-a-tarball-of-everything-and-upload-it-to-the-cloud backup methodology, I prefer something a bit more incremental. But with Backblaze B2 being so cheap, I might as well just upload tarballs to B2. As long as I have the data, the servers can be redeployed in a couple of hours at most.

The SBG fire illustrates the importance of geographical redundancy. Just because the datacenters have different numbers at the end doesn't mean that they won't fail at the same time. Apart from a large fire or power outage, there are lots of things that can take out several datacenters in close vicinity at the same time, such as hurricanes and earthquakes.


> I'm not a big fan of just-make-a-tarball-of-everything-and-upload-it-to-the-cloud backup methodology, I prefer something a bit more incremental.

pretty much a textbook use-case for zfs with some kind of snapshot-rolling utility. Snap every hour, send backups once a day, prune your backups according to some timetable. Transfer as incrementals against the previous stored snapshot. Plus you get great data integrity checking on top of that.

"but linus said..."


>"but linus said..."

Yes i still don't understand him, a he calls himself a "filesystem guy". Also i don't understand that no one ever mentions NILFS2.


with all due respect here - I've never heard of it either, and that's not what you want with a filesystem.

The draw of ZFS is that it's the log-structured filesystem with 10 zillionty hours of production experience that says that it works. And that's why BTRFS is not a direct substitute either. Or Hammer2. There are lots of things that could be cool, the question is are you willing to run them in production.

There is a first-mover advantage in filesystems (that occupy a given design and provide a given set of capabilities). At some point a winner sucks most of the oxygen out of the atmosphere here. There is maybe space for a second place winner (btrfs), there isn't a spot for a fourth-place winner.


You can do the same with Ext4


Ext4 has no snapshot feature, do you mean with lvm?


Yes LVM, sorry


Another option for incremental backups is Restic [0]. It has support to backup to Backblaze B2, Amazon S3 and lots of other places.

[0] https://restic.net/


I’ve taken to uploading via rsync or similar entire copies - as tarballs use the whole bandwidth each time but rsync on files brings only the changes.


I use tarballs because it allows me to not trust the backup servers. ssh is set up such that backup server's ssh keys are certified to only run a command that will allow them to run a backup script that will just return the encrypted data, and nothing else.

It's very easy to use spare storage in various places to do backups this way, as ssh, gpg and cron are everywhere, and you don't need to install any complicated backup solutions or trust the backup storage machines much.

All you have to manage centrally is private keys for backup encryption, and CA for signing the ssh keys + some occasional monitoring/tests.


One up for rclone, it's parallel and supports many endpoints.


Can't you add only changes to a tar?


Indeed it does; see the --listed-incremental and --incremental options:

https://www.gnu.org/software/tar/manual/tar.html#Incremental...


Duplicity is your best bet for incremental backups using B2. I use this for my personal server and it works brilliantly.


I thought so too for a long while. Until I was trying to restore something (just to test things), and wasn’t able to... it might have been specific to our GPG or an older version or something... but I decided to switch to restic and am much happier now.

Restic has a single binary that takes care of everything. It feels more modern and seems to work really well. Never had any issue restoring from it.

Just one data point. Stick to whatever works for you. But important to test not only your backups, but also restores!


I've been using Duplicati forever. The fact that it's C# is a bit of a pain (some distros don't have recent Mono), but running it in Docker is easy enough. Being able to check the status of backups and restore files from a web UI is a huge plus, so is the ability to run the same app on all platforms.

I've found duplicity to be a little simplistic and brittle. Purging old backups is also difficult, you basically have to make a full backup (i.e. non-incremental) before you can do that, which increases bandwidth and storage cost.

Restic looks great feature-wise, but still feels like the low-level component you'd use to build a backup system, not a backup system in itself. It's also pre-1.0.


Interesting, I will check Restic out, I’ve heard other good things about it. Duplicity is a bit of a pain to set up and Restic’s single binary model is more straightforward (Go is a miracle). Thanks for the recommendation!

GPG is a bit quirky but I do regularly check my backups and restores (if once every few months counts as regular).


+1 for Restic

It's brilliant, works like a charm on freebsd windows and a rpi with linux since over 2 years.


I'm using rclone, it works very well for the purpose too.


Ditto. Moved to rclone after having a bunch of random small issues with Duplicity that on their own weren't major but made me lose faith in something that's going to be largely operating unsupervised except for a monthly check-in.


I’d stay away from duplicity. I’ve had serious problems with it and large inode counts where it’ll bang the CPU at 100% and never complete.

Have moved to using rdiff-backup over SSH.


Self-hosted Kuberenetes and a FreeNAS storage system at home, and a couple of VMs in the cloud. I've got a mixed strategy, but it covers everything to remote locations.

I use S3 API compatible object storage platforms for remote backup. E.g. BackBlaze B2. I wrote about my backup scripts for FreeNAS (jail that runs s3cmd to copy files to B2) here: https://www.shogan.co.uk/cloud-2/cheap-s3-cloud-backup-with-...

For Kuberenetes I use velero which can be configured with an S3 storage backend target: https://www.shogan.co.uk/kubernetes/kubernetes-backup-on-ras...


Personal: I run a webserver for some website (wordpress + xenforo), I've set up a cronjob that creates a backup of /var/www, /etc and a mysql database dump, then uploads it to an S3 bucket (with automatic Glacier archiving after X period set up). It should be fairly straightforward to rent a new server and set things back up. I still dislike having to set up a webserver + php manually though, I don't get why that hasn't been streamlined yet.

My employer has a single rack of servers at HQ. It's positioned at a very specific angle with an AC unit facing it, their exact positions are marked out on the floor in tape. The servers contain VMs that most employees work on, our git repository, issue trackers, and probably customer admin as well. They say they do off-site backups, but honestly, when (not if) that thing goes it'll be a pretty serious impact on the business. They don't like people keeping their code on their take-home laptop either (I can't fathom how my colleagues work and how they can stand working in a large codebase using barebones vim over ssh), but I've employed some professional disobedience there.


Have you considered writing an ansible playbook to set all that up? You could even have it pull down the backup and do a full restore for you...


Basically the same (offsite backups), but the details are in the what and how which is subjective... For my purposes I decided that offsite backups should only comprise user data and that all server configuration be 100% scripted with some interactive parts to speed up any customization including recovering backups. I also have my own backup servers rather than using a service, and implement immutable incremental backups with rotated ZFS snapshots (this is way simpler than it sounds) - I can highly recommend ZFS as an extremely reliable incremental backup solution but you must enable block level deduplication and expect it to gobble up all the server RAM to be effective (but that's why I dedicate a server to it and don't need masses of cheap slow storage)... also the backup server is restorable by script and only relies on having at least one of the mirrored block devices in tact which I make a local copy of occasionally.

I'm not sure how normal this strategy is outside of container land but I like just using scripts, they are simple and transparent - if you take time and care to write them well.


This sounds like what I want to do for the new infrastructure I'm setting up in one of OVH's US-based data centers. Are you running on virtual machines or bare metal? What kind of scripting or config management are you using?


VPS although there is no dependency on VPS manager stuff so I don't see any issue with running on bare metal. No config managers, just bash scripts.

They basically install and configure packages using sed or heredocs with a few user prompts here and there for setting up domains etc.

If you are constantly tweaking stuff this might not suit you, but if you know what you need and only occasionally do light changes (which you must ensure the scripts reflect) then this could be an option for you.

It does take some care to write reliable clear bash scripts, and there are some critical choices like `set -e` so that you can walk away and have it hit the end and know that it didn't just error in the middle without you noticing.


Servers are at a mix of "cloud" providers, and on-site. Most data (including system configs!) is backed up on-site nightly, and to B2 nightly with historical copies - and critical data is also live-replicated to our international branches. (Some "meh" data is backed up only to B2, like our phone logs; we can get most of the info from our carrier anyway).

Our goal and the reason we have a lot of stuff backed up on-prem is to have our most time-critical operations back up within a couple of hours - unless the building is destroyed, in which case that's a moot point and we'll take what we can get.

A dev wiped our almost-monolithic sales/manufacturing/billing/etc MySQL database a month or two ago. (I have been repeatedly overruled on the topic of taking access to prod away from devs) We were down for around an hour. Most of that time was spent pulling gigs of data out of the binlog without also wiping it all again. Because our nightly backups had failed a couple weeks prior - after our most recent monthly "glance at it".


Less than a day for disaster recovery on fresh hardware? Same as my case. As you say, good enough for most purposes, but I'm also looking for improvement. I have offsite realtime replicas for data and mariaDBs, and offsite nightly backups (combo of rsnapshot, lsyncd, mariaDB multi-source replication, and a post-new-install script that setups almost everything in case you have to recover on bare-metal, i.e. no available VM snapshots).

Currently trying to reduce that "less than a day" though. Recently discovered "ReaR" (Relax and Recover) from RedHat and sounds really nice for bare-metal servers. Not everybody runs on virtualized/cloud (being able to recover from VM snapshots is really a plus). Let's share experiencies :)


We have two servers at OVH (RBX and GRA, not SBG). I make backups of all containers and VMs every day and keep the last three, plus one one each month. Backups are stored in a separate OVH storage disk and also downloaded to a NAS on-premise. In case of a disaster, we'd have to rent a new server, reprovision the VMs and containers and restore the backups. About two days of work to make sure everything works fine and we could lose about 24 hours of data.

It's not the best in terms of Disaster Recovery Plan but we accept that level of risk.


Nothing too crazy, just a simple daily cron to see sync user data and database dumps on our OVH boxes to backblaze and rsync.net. This simple setup is already saved our asses a few times already.


Most people/companies don't have money to setup those disaster plans. They need you to have a similar server ready to go and also a backuo solution like Amazon S3.

I was affected, my personal VPS is safe but down and other VPS I was managing I don't know anything about. I have the backups and right now I'd love for them to just set me up a new VPS so I can restore the backups and restore the services.


Spin it up now and refuse to pay for the old one later.


Personally my email hosting is down but thanksfully my web hosting and nextcloud instance were both at GRA2 (Gravelines).

But i have a friend who potentially lost important uni work hosted on his nextcloud instance... On SBG2.

A rough reminder that backups are really important, even if you are just an individual


I only have a personal server running in Hetzner but it's mirrored onto a tiny local computer at home.

They both run postfix + dovecot, so mail is synced via dovecot replication. Data is rsync-ed daily, and everything has ZFS snapshots. MySQL is not set into replication - my home internet breaks often enough to have serious issues, so instead I drop everything every day import a full dump from the main server, and do a local dump as backup on both sides.

I don't have automatic failover set up.


Not saying that you should never do a full mysql dump. Nor that you should not ensure that you can import a full dump.

But when you already use ZFS you can do a very speedy full backup with:

    mysql << EOF
        FLUSH TABLES WITH READ LOCK;
        system zfs snapshot data/db@snapname
        UNLOCK TABLES;
    EOF
Transfer the snapshot off-site (and test!). Either as a simple filecopy (the snapshot ensured a consistent database) or a little more advanced with zfs send/receive. This is much quicker and more painless than mysql dump. Especially with sizeable databases.


Do you even need to flush the tables and grab a read lock while taking the ZFS snapshot? My understanding was that since ZFS snapshots are point-in-time consistent, taking a snapshot without flushing tables or grabbing a read lock would be safe; restoring from that snapshot would be like rebooting after losing power.


I think you are correct. But then you risk data as you would with the unclean shutdown. I much prefer to have a known clean state which all things considered should be a safer bet. Just like some are OK running without fsync.


Good point, but my DB is tiny, so for now, I can afford the mysqldump. But I'll keep this in mind.


I don't have to "backup servers" for a long time now. I have an Ansible playbook to deploy and orchestrate services, which, in turn, are mostly dockerized. So my recovery plan is to turn on "sorry, maintenance" banner via CDN, spin up a bunch of new VPSes, run Ansible scenario for deployment and restore database from hidden replica or latest dump.


> restore database from hidden replica or latest dump

You do have backup servers.


He said he doesn't have to backup servers, not that he doesn't have backup servers.


My recovery plan: tarball & upload to Object Store. I'm going to check out exactly how much replication the OVH object store offers, and see about adding a second geographic location, and maybe even a second provider, tomorrow.

(My servers aren't in SBG either - phew!)


If your primary data is on OVH, I'd look at using another company's object store if feasible (S3, B2, etc). If possible, on another payment method. (If you want to be really paranoid, something issued under another legal entity.)

There's a whole class of (mostly non-technical) risks that you solve for when you do this.

If anything happens with your payment method (fails and you don't notice in time; all accounts frozen for investigation), OVH account (hacked, suspended), OVH itself (sudden bankruptcy?), etc, then at least you have _one_ other copy. It's not stuff that's likely to happen, but the cost of planning for it at least as far as "haven't completely lost all my data even if it's going to be a pain to restore" here is relatively minimal.


Snapshots, db backups and data backups.

Rolling backups with a month retention to box using rsync.

It creates a network drive to box by default when I boot my desktop.

I have some scripts for putting production db's in test and when I went them locally.


Does anyone have experience with lvarchive or FSArchiver (or similar) to backup images of live systems instead of file based backup solutions?


I have three servers (1 OVH - different location, 2 DO). The only thing I backup is the DB, which is synced daily to S3. There's a rule to automatically delete files after 30 days to handle GDPR and stop the bucket and costs spiralling out of control.

Everything is managed with Ansible and Terraform (on DO side), so I could probably get everything back up and running in less than an hour if needed.


> probably

That makes it sound like you didn't try/practice. I imagine that in a real-life scenario things will be a little more painful than in one's imagination.


Exactly. Having a plan is only part of it. Good disaster plans do dry runs a couple of times a year (when time changes is always convenient reminder). If you rehearse the recovery when you're not panicked, you have a better chance of not skipping a step when the timing is much more crucial. Also, some sort of guide with steps given procedurally is a great idea.


I don't think this is necessarily true for all parts of a disaster plan. Some mechanisms may be untestable because it is unknown how to actually trigger it (think certain runtime assertions, but on a larger scale).

Even if it possible to trigger and test, actually using the recovery mechanism may have some high cost either monetarily or maybe losing some small amount of data. These mechanisms should almost always be an additional layer of defense and only be invoked in case of true catastrophe.

In both cases, the mechanisms should be tested as thoroughly as possibly, either through artificial environments that can simulate improbable scenarios or in the latter case on a small test environment to minimize cost.


I haven't ever deleted everything and timed how long I could get it up and running again, but I have tested it works by spinning up new machines and moving everything over to there (it was easier than running "sudo apt-get dist-upgrade").


Here's what i do for my homelab setup that has a few machines running locally and some VPSes "in the cloud":

I personally have almost all of the software running in containers with an orchestrator on top (Docker Swarm in my case, others may also use Nomad, Kubernetes or something else). That way, rescheduling services on different nodes becomes less of a hassle in case of any one of them failing, since i know what should be running and what configuration i expect it to have, as well as what data needs to be persisted.

At the moment i'm using Time4VPS ( affiliate link: https://www.time4vps.com/?affid=5294 ) for the stuff that needs decent availability and because they're cheaper than almost all of the alternatives i've looked at (DigitalOcean, Vultr, Scaleway, AWS, Azure) and that matters to me.

Now, in case the entire data centre disappears, all of my data would still be available on a few HDDs under my desk (which are then replicated to other HDDs with rsync locally), given that i use BackupPC for incremental scheduled backups with rsync: https://backuppc.github.io/backuppc/

For simplicity, the containers also use bind mounts, so all of the data is readable directly from the file system, for example, under /docker (not really following some of the *nix file system layout practices, but this works for me because it's really easy to tell where the data that i want is).

I actually had to migrate over to a new node a while back, took around 30 minutes in total (updating DNS records included). Ansible can also really help with configuring new nodes. I'm not saying that my setup would work for most people or even anything past startups, but it seems sufficient for my homelab/VPS needs.

My conclusions:

  - containers are pretty useful for reproducing software across servers
  - knowing exactly which data you want to preserve (such as /var/lib/postgresql/data/pgdata) is also pretty useful, even though a lot of software doesn't really play nicely with the idea
  - backups and incremental backups are pretty doable even without relying on a particular platform's offerings, BackupPC is more than competent and buying HDDs is far more cost effective than renting that space
  - automatic failover (both DNS and moving the data to a new node) seems complicated, as does using distributed file systems; those are probably useful but far beyond what i actually want to spend time on in my homelab
  - you should still check your backups


Do you have access to your control panel when the servers are down?


> What's YOUR Disaster Recovery Plan?

Prayer and hope, usually.


Never thought i'd read the header

> Printer flammability


A lot of prayers....


A status update on the OVH tracker for a different datacenter (LIM-1 / Limburg) says "We are going to intervene in the rack to replace a large number of power supply cables that could have an insulation defect." [0][1] The same type of issue is "planned" in BHS [3] and GRA [2].

Eerie timing: do they possibly suspect some bad cables?

[0]: http://travaux.ovh.net/?do=details&id=49016

[1]: http://travaux.ovh.net/?do=details&id=49017

[3]: http://travaux.ovh.net/?do=details&id=49462

[2]: http://travaux.ovh.net/?do=details&id=49465


>Eerie timing: do they possibly suspect some bad cables?

Why not? Cables with ratings lower than the load they are carrying is a prime cause for electrical fires. If the load is too high for long enough, the shielding melts away, and if it is close enough for other material to catch fire then that's the ball game. It's a common cause for home electrical fires. Some lamp with poor wiring catches the drapes on fire, etc. Wouldn't think a data center would have flammable curtains though.


This definitely could be probable cause. When I was a teen I witnessed such fire once. Basically a friend had a heater but couldn't find any mains cable and eventually decided to disconnect mains cable from the radio and use that. After few minutes the isolation from the mains cable melted away and the cable turned glowing red and it started burning the table it was on. Fortunately we were not asleep and got it under control quickly. Lesson learned.


We have several bare-metal servers on GRA/Gravelines & RBX/Roubaix, 3 weeks ago we had a 3h downtime on RBX because they were replacing power cords without previous notification. Maybe they were aware this could happen, and were in the process to fix it


https://www.google.com/search?q=site%3Ahttp%3A%2F%2Ftravaux.... there's quite a few of these

http://travaux.ovh.net/?do=details&id=47840 earliest one that I found was back in December


They're waiting an awful long time to do the one at BHS-7 if so: 14 days from now?


I had mine restarted this morning in BHS. Only one of 3 servers...


Some pictures of the building with firefighters at work

https://www.dna.fr/faits-divers-justice/2021/03/10/strasbour...

Edit:

Video at https://www.youtube.com/watch?v=a9jL_THG58U

Satellite view of the site on Google Maps https://goo.gl/maps/L2T6YNFCtiyDdiNv7


That is a disturbingly large amount of flame. I was expecting a datacenter to not have much in the way of flammable material.

...then I read in some other comments here that they used wood in the interior construction.


Wood is a reasonable construction material.

It takes a good while to start burning, and even when significantly charred it still retains most of its strength.


Wood is a reasonable construction material for my house, or an office - but is it for a building with that much energy kept that close together?


Treated lumber is generally considered to be fairly fireproof (comparable, though with different precise failure modes than steel or concrete). It depends on exactly what kind of wood is being used. A treated 12x12 beam is very fire resistant, plywood is less so.

The issue is you'll have lots of plastic (cabling) in a DC, and plastic will burn


There is self-extinguishing cable insulation, though. I'm actually surprised this (DC flammability) is still an issue, and not already solved by making components self-extinguishing and banning non-tiny batteries inside the equipment. If you want to have a battery for your raid controller, put something next to it that will stop your system from burning down it's surroundings.


Easy to see now that a lightly constructed five story cube might not be a fully fire proof.


Would you humor us with a link to a fully fire proof datacenter?


Newly constructed datacenters in the US tend to be all metal with a full building clean suppression agent. https://www.fike.com/products/ecaro-25-clean-agent-fire-supp...

I used to work for a provider whose 2 main datacenters of 8k+ sq ft could pull all oxygen out of the building in 60 seconds.


Data centres I used to work in back in the early 2000s had argonite gas dumps in place (prior to argonite, halon used to be popular but is an ozone depleting gas so was phased out)

In the case of a fire, it would dump a lot of argonite gas in and consume a large amount of the oxygen in the room, depriving the fire of fuel. It's also safe and leaves minimal clean-up work afterwards, doesn't harm electronics etc. unlike sprinklers and the like.

The amount of oxygen left is sufficient for human life, but not for fires, though my understanding is that it can be quite unpleasant when it happens. You won't want to hang around.


One of ours had a giant red button you could hold to pause the 60 second timer before all the oxygen was displaced. Every single engineer was trained to immediately push that if people were in the room because it was supposedly a guaranteed death if you got stuck inside once the system went off.


Well, yeah, these normal inert gas fire suppression systems don't do a good job if humans can still breathe. The Novec 1230 based ones can actually be sufficiently effective for typical flammability properties you can cheaply adhere to in a datacenter, but even then you iirc would want to add both that and some extra oxygen, because the nitrogen in the air is much more effective at suffocating humans than at suffocating fire. This stuff is just a really, really heavy gas that's liquid below about body temperature (boils easily though), and the heat capacity of gasses is mostly proportional to their density.

Flames are extinguished by this cooling effect (identical to water in that regard), but humans rely on catalytic processes that aren't affected by the cooling effect.

If you could keep the existing oxygen inside, while adding Novec 1230, humans could continue to breathe while the flames would still be extinguished, but this would require the building/room to be a pressure chamber that holds about half an atmosphere of extra pressure. I'm pretty sure just blowing in some extra oxygen with the Novec 1230 would be far cheaper to do safely and reliably.

I mean, in principle, if you gave re-breathers to the workers and have some airlocks, you could afford to keep that atmosphere permanently, but it'd have to be a bit warm (~30 C I'd guess). Don't worry, the air would be breathable, but long-term it'd probably be unhealthy to breathe in such high concentrations and humans breathing would slightly pollute the atmosphere (CO2 can't stay if it's supposed to remain breathable).

Just to be clear: in an effective argonite extinguishing system you'd have about a minute or two until you pass out and need to be dragged out, ideally get oxygen, get ventilated (no brain, no breathing) and potentially also be resuscitated (the heart stops shortly after your brain from a lack of oxygen, so if you're ventilated fast enough, it never stops and you wake up a few externally-forced breaths later). Having an oxygen bottle to supplement your breaths would fix that problem for as long as it's not empty.


> I mean, in principle, if you gave re-breathers to the workers and have some airlocks, you could afford to keep that atmosphere permanently,

At this point I feel like it would be cheaper just to not have workers go there. Fill the place completely full of nitrogen with an onsite nitrogen generator (and only 1atm pressure). Have 100% of regular maintenance and as much irregular maintenance as possible be done by robots. If something happens that requires strength/dexterity beyond the robots (e.g. a heavy object falling over), either have humans go in in some form of scuba gear, or if you can work around it just don't fix it.


That seems reasonable. But just to clarify what I meant with airlock: some thick plastic bag with a floor-to-ceiling zipper on the "inner" and "outer" ends, and for entry, it's first collapsed by a pump sucking the air out of it. Then you open the zipper on the outer end, step in, close the zipper, let the pump suck away the air around you, and open the inner zipper (they should probably be automatically operated, as you can't move well/much when you are "vacuum bagged").

For exit, basically just the reverse, with the pump pumping the air around the person to wherever the person came from.

The general issue with unbreathable atmospheres is that a failure in their SCBA gear easily kills them.

And re-breathers that are only there so you don't have to scrub the atmosphere in the room as often shouldn't be particularly expensive. You may even get away with just putting a CO2 scrubber on the exhaust path, and giving them slightly oxygen-enriched bottled air so you can keep e.g. a 50:50 oxygen:nitrogen ratio inside (so e.g. 20% O2, 20% N2, 60% Novec 1230). And it doesn't even need to be particularly effective, as you can breathe in quite a bit of the ambient air without being harmed, and the environment can tolerate some of your CO2. Like, as long as it scrubs half of your exhausted CO2 it won't even feel stuffy in there (you could handle the ambient air you'd have without re-breathers being used, as it'd be just 1.6% CO2, but you'd almost immediately get a headache).

They'd have an exhaust vent for pressure equalization, which would chill the air to condense and re-cycle the Novec 1230. For pressure equalization in the other direction, they'd probably just boil off some of that recycled Novec 1230.

So yeah, re-breather not needed, if you just get a mouth+nose mask to breathe bottled 50:50 oxygen:nitrogen mix. That 50% oxygen limit (actually 500 mBar) is due to long-term toxicity, btw. Prolonged exposure to higher levels causes lung scarring and myopia/retina detachment, so not really fun.


I think Microsoft had some experimental datacenter containers they submerged in the northern Atlantic for passive cooling, and I believe those were filled with an inert gas as well. I guess that would come very close to an actual fireproof datacenter.


Yep you’re right, learning about this was part of some onboarding training we had to complete...

it was an interesting proof of concept, but finding the right people to maintain the infra with both IT and Scuba skills was a narrow niche to nail down ;)


I don't opening it at any point before decomissioning it completely is even an afterthought with that. They just write off any failures and roll with it as long as it's viable.


Yeah that was one of the show-stopping issues, inability to repair / hotswap etc...


Well, you only need to get shared infrastructure reliable enough that you can afford to not design it with repair in mind. The cloud servers are already design without unit-level maintenance work in mind, which saves money by eliminating rails and similar. They get populated racks from the factory and just cart them from the dock to their place, plug them in (maybe run some self-test to make sure there are no loose connectors or so), and eventually decommission them after some years.


From the outside (and from a position lacking all inside knowledge) it looks highly interconnected and very well ventilated. I'm not sure where you'd put a inert gas supression system or beefy firewalls to slow the fire progress.


How do you see that?


Your Google Maps link seems to be off by a bit.


Cloud going up in smoke.


According to the official status page, the whole datacenter is still green http://status.ovh.com/vms/index_sbg2.html


I feel like there should be place to report infrastructure suppliers with misleading status pages, some kind of crowdsourced database. Without this information, you only find out that they are misleading when something goes very wrong.

At best you might be missing out on some SLA refunds, but at worst it could be disasterous for a business. I've been on the wrong side of a update-by-hand status system from a hosting provider before and it wasn't fun.


Who monitors the monitors?

Agreed, though. A fake status page is worse than no status page. I don't mind if the status page states that it's manually updated every few hours as long as it's honest. But don't make it look like it's automated when it's not.



Wtf is this disclaimer on Down Detector for? (Navigate to OVH page.). It sits in front of user comments, I think:

> Unable to display this content to due missing consent. By law, we are required to ask your consent to show the content that is normally displayed here.


It's a Disqus widget. If you denied consent for third party tracking, they can't load it.

I've usually seen this with embedded videos rather than comments.


Thanks, can't belive it's taken me 8 years to learn about that.


Would a blatantly false status page meet the criteria for false advertising?

My gut says yes.


Also, 3 years ago they had an outage in Strasbourg, and the status page was down apparently as a result of the outage.

https://news.ycombinator.com/item?id=15661218

They are not the only ones though. All too common. Well, it's tricky to set this up properly. The only proper way would be to use external infra for the status page.


It's not difficult to make a status page with minimal false negatives. Throw up a server on another host that shows red when it doesn't get a heartbeat. But then instead you end up with false positives. And people will use false positives against you to claim refunds against your SLA.

So nobody chooses to make an honest status page.


As someone who maintained a status page (poorly), I'm sorry on behalf of all status pages.

But, they're usually manual affairs because sometimes the system is broken even when the healthcheck looks ok, and sometimes writing the healthcheck is tricky, and always you want the status page disconnected from the rest of the system as much as possible.

It is a challenge to get 'update the status page' into the runbook. Especially for runbooks you don't review often (like the one for the building is on fire, probably).

Luckily my status page was not quite public; we could show a note when people were trying to write a customer service email in the app; if you forget to update that, you get more email, but nobody posts the system is down and the status page says everything is ok.


Yep. I guess what could be done is a two-tiered status page: automated health check which shows "possible outage, we're investigating" and then a manual update (although some would say it looks lame to say "nah, false positive" which is probably why this setup is rare).


Isn’t that why things like statuspage.io exist, though?


Where is statuspage.io hosted?


I work for a CDN, and we had to change our status page provider once when they became our customer.


In the cloud of course. Why do you ask?


You're completely right. IMO this is the best comment in this whole thread. Their status page must be broken, or it's a lie.


If there is a SLA with consequences associated with it every status page is going to be a lie.


Well, it sucks to catch fire and I care for the employees and the firemen, but if their status page is a lie then I have a whole lot less sympathy for the business. That's shady business and they should feel bad.

I can appreciate an honest mistake though, like the status page server cron is hosted in the same cluster that caught fire and hence it burnt down and can't update the page anymore.


Is the status page relevant though? At the very least, OVH immediately made a status announcement on their support page and they've been active on Twitter. I don't see anything shady here. From their support page:

> The whole site has been isolated, which impacts all our services on SBG1, SBG2, SBG3 and SBG4. If your production is in Strasbourg, we recommend to activate your Disaster Recovery Plan

What more could you want?


> Is the status page relevant though?

What's the point of a status page then if it does not show you the status? I don't want to be chasing down twitter handles and support pages during an outage.


Still better than Amazon where their status page describes little and fat chance of anyone official sharing anything on social media either.

I wonder if a server fire would cause Amazon to go to status red. So far anything and everything has fallen under yellow.


To have a status page that reflects actual statuses? To know that I'm not being lied to or taken advantage of? To know that my SLA is being honored?


lol... that’s how most status pages are


The weather map is interesting: http://weathermap.ovh.net/

No traffic whatsoever between sbg-g1 and sbg-g2 ant their peers.


It seems to be a static site, which seems reasonable since it aggregates a lot of data and might encounter high load when something goes wrong, so generating it live without caching is not viable. So maybe the server that normally updates it is down too (not that this would be a good excuse)?


10+ hours cache on a status page doesn't look like real-time monitoring to me.

I think this is probably linked to a manual reporting system and they got bigger fish to fry at the moment than updating this status page.


Counterpoint: There's a constantly updating timestamp at the top of the page that suggests it's automated and real time.


> Legend: Servers down: 0 1+ 4+ 6+ 8+ 10+ 15+

If they don't have any servers anymore, how can they be down ;)




Noticed the same this morning, but note they took the link out of the main page (of SBG2 and 3): https://www.ovh.com/world/community/status/

Probably the best they could do at the moment.


Four hours later still green.


https://twitter.com/olesovhcom/status/1369504527544705025

"Update 5:20pm. Everybody is safe. Fire destroyed SBG2. A part of SBG1 is destroyed. Firefighters are protecting SBG3. no impact SBG4."


> Fire destroyed SBG2

This is crazy.

SBG2 was HUGE and if this isn't a translation error on the part of Octave (which I could understand given the stress and ESL) I have a hard time fathoming what kind of fire could destroy a whole facility with nearly 1000 racks of equipment spread out across separated halls.

I'm really hoping "destroyed" here means "we lost all power and network core and there's smoke/fire/physical damage to SOME of that"

I can't even fathom a worst-case scenario of a transformer explosion (which does occur and I've seen the aftermath of) having this big of an impact. Datacenters are built to contain and mitigate these kinds of issues. Fire breaks, dry-pipe sprinkler systems and fire-extinguishing gas systems are all designed to prevent a fire from becoming large-scale.

Really glad nobody was hurt. OVH is gonna have a bad time cleaning all this up.



I’d say destroyed is an adequate word, yes.


I'm really shocked, that's an incredible amount of damage. Lots of people lost data.


If a fire supression kicks in or the fire department shows up with their hoses, would they still say the fire destroyed it or just say destroyed due to fire and water damage?

Also, fire suppression system do fail. There was an infamous incident in LA for one of the studios. They built a warehouse to be a tape vault with tapes going back to the 80s. A fire started, but the suppression system failed because there was not enough pressure in the system. Total loss. Got to keep your safety equipment properly tested!


It wasn’t just “tapes going back to the 80s.” Those were just the media Universal initially admitted to losing. No, that building was the mother lode. It had film archives going back over 50 years, and worst of all — unreleased audio tape masters for thousands upon thousands of recording artists. The amount of “remastered” album potential that fire destroyed is probably in the billions of dollars, let alone the historical loss of all those recordings by historical persons that will never be heard due to a failure of a fire prevention system. Fascinating case study in why you should never put all your eggs in one basket.

https://en.m.wikipedia.org/wiki/2008_Universal_Studios_fire


In 1996 Surinam lost many of their government archives after a fire burned the building down.

I can find surprisingly few English-language resources for this (only Dutch ones); guess it's a combination of a small country + before the internet really took off.


This is why I'm such a fan of digitizing. If you have 1 film master you effectively have none. Do some 8K or 16K scans of that master and effectively manage the resulting data and you're effectively 100% immune from future loss in perpetuity.

Losing parts of history like that is tragic.


There's a problem with testing sprinklers: engaging them can be damaging to contents and even structures. So, we're talking about completely emptying the facility, then taking it offline to dry for a time. I've never heard about this being done to anything that was already operational (but I wasn't researching this either).


There are methodds of testing the water pressure in the pipes without actually engaging the sprinkler heads. It is part of the normal checks done during the maintenence/inspection a business is supposed to have done. In fact, one place I was in had sensors, and would sound the actual fire alarm if the pressure fell below tolerance at any time. The lack of pressure in the Universal vault was 100% unexcusable.


It’s common for sprinklers in parking garages in cold climates to be “dry” where there’s a bubble of air that needs to run through first before water shoots out from a non-freezeable source.

Want to test the system? Just turn off that water valve, hook up your air compressor and pressure test to your heart’s content.


Isn't something like Halon used in data centers for that reason? That can probably be tested without damaging infrastructure.


Halon is pretty much banned for years now, other agents have been introduced. Sadly, making an actual full test of a gaseous extinguishing system (such as FM200 or Novec 1230) can be prohibitively expensive (mainly costs of the "reloading" of the system with new gas). Those are just mostly tested for correct pressure in tanks and if the detection electronics are working fine, making an actual dump would be very impractical (evacuation of personnel, ventilating the room afterwards etc.)


Good point. But, apart from cost of the gas, mentioned by sibling comment, that also isn't disruption-free. Gas-based systems are by definition (they displace oxygen) dangerous to humans. But yeah, not being able to run maintenance is orders of magnitude less problematic.


OVH design their own datacenters, so it's possible that they missed something or some system or another didn't work as intended, thus the heavy damage.


They did not have a fire suppression system, only smoke detection. So yeah, they missed something.


There are photos of the fire suppression system in one of their data centres in this (French) forum thread: https://lafibre.info/ovh-datacenter/ovh-et-la-protection-inc.... They have sprinklers, with the reasoning being that the burning racks' data is gone anyway if there's a fire, and at least sprinklers don't accidentally kill the technicians.


> They have sprinklers, with the reasoning being that the burning racks' data is gone anyway if there's a fire

I think the real problem, per that post, is this:

>> They are simple sprinklers that spray with water. It has nothing to do with very high pressure misting systems, where water evaporates instantly, and which save servers. Here, it's watering, and all the servers are dead. It's designed like that. Astonishing, isn't it?

>> Obviously, they rely above all on humans to extinguish a possible fire, unlike all conventional data centers.

(all thanks to Google Translate)

This strikes me as a terrible safety system because even if a human managed to detect the fire, they have to make a big call: is the risk of flooding the facility and destroying a ton of gear worth putting out a fire? By the time the human decides "yes, it is", it may well be too late for the sprinklers.

> and at least sprinklers don't accidentally kill the technicians.

Not a real risk with modern 1:1 argon:nitrogen systems – the goal is to pump in inert gases and reduce oxygen content to around 13%, a point where the fire is suppressed and people can survive. You wouldn't want to be in a room breathing 13% oxygen for a long time, but it won't kill you.

All in all, it looks like this was a "normal accident"[1] for a hosting company that aggressively competes on price. The data center was therefore built with less expensive safeties and carried a higher risk of catastrophic failure.

[1]: https://en.wikipedia.org/wiki/Normal_Accidents


Sprinklers are only present in OVH Canada datacenter ! There are no sprinklers in Europe ones.


Given that they're headquartered in Europe and most popular there, why is the satellite location better? Is it because the Canadian data center is newer, because Canada has stronger regulations in this area, or something else? Also, does anyone know how the OVH US data centers compare?


According to a 2013 forum post by the now CEO of Scaleway, a competitor of OVH, it's due to North American building regulations that basically force you into sprinklers and stuff for insurance reasons.

Source in French: https://lafibre.info/ovh-datacenter/ovh-et-la-protection-inc...


Where did you get this?



They say different here: "The rooms are also equipped with state of the art fire detection and extinction systems."

https://www.ovh.com/world/solutions/centres-de-donnees.xml

and here

"Every data center room is fitted with a fire detection and extinction system, as well as fire doors. OVHcloud complies with the APSAD R4 rule for the installation of mobile and portable fire extinguishers and has the N4 certificate of conformity for all our data centers." https://us.ovhcloud.com/about/company/security


That's for its North America DCs, not European ones.


Which is understandable, as Halon based fire suppression systems have been illegal for quite some time :/


There are non-Halon systems, using agents such as FM-200. Those are not as toxic and do not destroy the ozone layer.


> They did not have a fire suppression system

I find it very hard to believe that that would pass code anywhere in the US/EU or most of the world. They may not have had sprinklers but that doesn't mean there isn't fire suppression.


According to a forum post discussing the building of said DC, with photos and such, there's no visible fire suppression system:

https://lafibre.info/ovh-datacenter/ovh-et-la-protection-inc...


The floors are plywood and it's built with wood frame construction.

I retract all my prior surprise that this place burned down.


The pictures seem pretty clear that a lot of it is gone, judging by the blackened holes in the walls with firehoses being sprayed into.


Data centers frequently burn down, or are destroyed by natural disaster.

These days, fire suppression systems need to be non-lethal, so inert gasses are out. Water is too, for obvious reasons. Last I checked, they flooded the inside of the DC with airborne powder that coats everything (especially the inside of anything with a running fan). Once that deploys, the machines in the data center are a write off even if the fire was minor.


Most datacenters I’ve worked with in the last 5 years use water.


Just guessing, but maybe a fire suppression system going off could wipe out all the machines?

The couple datacenters I've been inside were small, old and used halon gas which wasn't supposed to destroy the machines. No idea how it works in big places these days.


A few years back there was an incident in Sweden where noise coming from the gas based fire suppression system going off destroyed hard drives [1].

1. https://www.theregister.com/2018/04/26/decibels_destroy_disk...


I've also seen a weird video (lost to time unfortunately), where someone showed that they could yell at their servers in a data centre and introduce errors (or something similar). Was very strange to see, but they had a console up clearly showing that it was having an impact.



I actually planned to include a link to this video in my original response but then left it out. Thanks for posting!


Because of course it was Bryan :D


Brian just giving his opinion on anything would give hard drives errors... That guy loves to run his mouth (and I love to listen).


Actually, although the video is on Bryan Cantrill's YouTube account, that was Brendan Gregg.


I was merely the videographer! For anyone who is curious, I discussed the backstory of that video with Ben Sigelman.[0]

[0] https://www.youtube.com/watch?v=_IYzD_NR0W4#t=28m46s



We had the same issue at a customer site. To add since the decibels were outside the rated environment the warranty was void on the harddisks and they had to be replaced even if they were not destroyed.


This was effectively posted on the outages list as well by someone trustworthy. The pictures also look pretty bad from the outside.


Which outages list? Sounds interesting.



Ah, thanks!


I wonder if they had batteries inside the containers. That can make a bad situation really worse.


For some context:

“SBG1, the first Strasbourg data center, consisting of twelve containers, came online in 2012. The 12 containers had a capacity of 12,000 servers.

SBG2 is a non-container data center in 2016 using its “Tower” design with a capacity of 30,000 servers.

SBG3 tower was built in 2017 with a capacity of 30,000 servers.

SBG4 was built in 2013 as several containers to augment capacity, but was decommissioned in 2018 and moved to SB3”

https://baxtel.com/data-center/ovh-strasbourg-campus


and a 2017 (!) article where they were planning to remove 1 and 2 because of power issues!

> OVH to Disassemble Container Data Centers after Epic Outage in Europe

> “This is probably the worst-case scenario that could have happened to us.”

> OVH [...] is planning to shut down and disassemble two of the three data centers on its campus in Strasbourg, France, following a power outage that brought down the entire campus Friday, causing prolonged disruption to customer applications that lasted throughout the day and well into the evening.

https://www.datacenterknowledge.com/uptime/ovh-disassemble-c...


It sounds like the plan was to shutdown 1 and 4, the latter of which happened and the former which did not.


It’s confusing as they say they’re shutting down 2 of the 3, but then there’s 4...


An easy mistake (in the quoted tweet) to make given the circumstances, but it probably should be 5:20am rather than pm.


> In the case of Roubaix 4, the Datacenter is made with a lot of wood:

> Finally, we have other photos of the floor of the OVH "Roubaix 4" tower. It is clearly wood! Hope it's fireproof wood! A wooden datacenter ... is still original, we must admit.

> In France, data centers are mainly regulated by the labor code, by ICPE recommendations (with authorization or declaration) and by insurers. At the purely regulatory level, the only things that are required are:

> - Mechanical or natural smoke extraction for blind premises or those covering more than 300m2

> - The fire compartmentalization beyond a certain volume / m2

> - Emergency exits accessible with a certain width

> - Ventilation giving a minimum of fresh air per occupant

> - Access to the firefighter from the facade for the premises are the low floor of the last level is more than 8 meters

> - 1 toilet for 10 people (occupying a position considered "fixed")

https://lafibre.info/ovh-datacenter/ovh-et-la-protection-inc...


Ah, interesting pics and discussions (in French) in the neighboring thread: https://lafibre.info/datacenter/incendie-sur-un-site-ovh-a-s...


"1 chiotte" -> "1 toilet" not puppy


A puppy would be called a "chiot", if someone wonders.


in case there were not posted before, here are pictures of SBG2 in flames taken by the firemen. https://twitter.com/xgarreau/status/1369559995491172354

This puts an image on the sentence "SBG2 is destroyed". Do not expect any recovery from SBG2.


Holy hell. Are these "datacenters" really just shipping containers? That's what it looks like.



Insert docker joke...


Aren't they a bargain-basement provider? You get what you pay for I guess?


Status shows greens across the board

http://status.ovh.com/vms/index_sbg2.html

Am I looking at the wrong thing or am I right to wonder why we still bother with public status pages if it never shows the real status?

Edit: nvm just saw another comment pointing out the same further down the thread (I randomly came across this page while looking for the physical location of another DC)


Puzzles and puzzle storm is down on Lichess:

> Due to a fire at one of our data centres, a few of our servers are down and may be down permanently. We are restoring these servers from backups and will enable puzzles and storm as soon as possible. > > We hope that everyone who is dealing with the fire is safe, including the firefighters and everyone at OVH. <3.


I opened two tabs to relax tonight: Hacker News and Lichess. This was the top HN thread, and Lichess is having issues because of the fire.

I didn't know what OVH was before 10 minutes ago, but this seems really impactful. I hope everyone there is safe and that the immediate disaster gets resolved quickly.


Look them up, they're one of the biggest hosting providers in the world ( especially in France), and due to cheap prices are especially pop-up with smaller scale stuff.


Yeah, they scale down to their kimsufi line which used to have quite powerful dedicated servers for the price of basic VPSes from other providers.

e.g. They have a 4core, 16gb ram server for $22/mo which is 25% of what my preferred provider, Linode, charges.

Now, it comes with older consumer hardware (that one is a sandy bridge i5), and about as much support as the price tag suggests, as well as a dated management interface, but when I used to run a modded minecraft server as a college student, which needed excessive amounts of RAM and could be easily upset by load spikes on other clients, then it was a no-brainer, even if I would expect the modern-ish Xeons Linode uses to win on a core for core basis.


Dated? They're probably the only place that isn't $comedy "bare metal cloud" pricing that not only has an API for their $5/m servers but also the panel is a reasonably modern SPA that implements that API and uses OAuth for login


Has it been replaced since I last used it in ~2016? This is not the interface I had to use at all.

This is the interface they had when I used them last:

https://www.youtube.com/watch?v=h5-J_DO_FS0


Yeah this is not at all what you use now. It's an Angular SPA.

It is still a mess of separate accounts, but you can use email addr to log in instead of random-numbers generated handle.

The OVHCloud US is completely separated for legal purposes, from what I remember. No account sharing, different staff, OVH EU cannot help you at all with US accounts.

https://www.youtube.com/watch?v=I2G6TkKg0gQ


Yes it changed. It was replaced by a more modern version a few years ago (but the transition was painful, as not everything was implemented in the new version when they started to deploy it).


for me, at times you'd not bother trying to remember your NIC-handle and just curl the API instead out of laziness


I think playing "from position" is also broken? I was playing chess with my friend and we usually play "from position" but it wasn't working just now so we're playing standard instead. It might be an unrelated bug.


This is certainly a good reminder to have regular backups. I have (had?) a VPS in SBG1, the shipping-container-based data centre in the Strasbourg site, and the latest I know is that out of the 12 containers, 8 are OK but 4 are destroyed [1]. Regardless, I imagine it will be weeks / probably months before even the OK servers can be brought back online.

Naturally, I didn't do regular backups. Most of the data I have from a 2019 backup, but there's a metadata database that I don't have a copy of, and will need to reconstruct from scratch. Thankfully for my case reconstructing will be possible - I know that's not the case for everyone.

Right now I'm feeling pretty stupid, but I only have myself to blame. For their part, OVH have been really good at keeping everyone updated (particularly the Founder and CEO, Octave Klaba).

I believe that when I signed up back in 2017, the Strasbourg location was advertised as one of the cheapest, so I can imagine a lot of people with a ~$4 / month OVH VPS are in the same situation, desperately scrambling to find a backup.

(For those that have a OVH VPS that's down right now, you can find what location it is in by logging onto the OVH control panel.)

[1] https://twitter.com/olesovhcom/status/1369598998441558016


This is a literal nightmare for me.

I can remember several San Diego fires that threatened the original JohnCompanies datacenter[1] circa mid 2000s and thinking about all of the assets and invested time and care that went into every rack in the facility.

Very interested to read the post-mortem here ... even more interested in any actionable takeaways from what is a very rare event ...

[1] Castle Access, as it was known, at the intersection of Aero Dr. and the 15 ... was later bought by Redit, then Kio ...


Man, 2006 was a freaky time to be in San Diego.


Here is some pics of what it looked like before, spg 1-4 , and it’s history

https://baxtel.com/data-center/ovh-strasbourg-campus


Is it just my laptop or is only ~40% of vertical screen real-estate dedicated to the actual content? :<


I have the same issue on my XPS 13 (4K screen), the header takes up a good 30% or so of the height of the screen and it's like reading through a mailbox slit.


Nope, exactly the same here (on landscape phone).


It's just you


I'm surprised to see how close the DCs are to the river. Fortunately it's in the high part of the river, less prone to overflow.


Maybe quite handy for water cooling?


SBG3 is almost adjacent to SBG2. It is impressive that the firefighters saved SGB3 with minimal damage.


There is no cloud, there is just other people’s computers. And they’re on fire.


Your data was in the cloud, now it's in the smoke.


...which has now become an actual cloud.


You might even say, it's gone with the wind...

Yeeeeeahhhhhhhhhh!!!!!


Ah, that explains why there are so many dark patterns in the cloud! /s


This is a literal kubectl delete pod


kubectl delete dc


The key differentiator of OVH is the very compact datacenters they achieve thanks to water cooling. Some OVH exec were touting about that in a recent podcast.

Interestingly in this case, having a very compact data center was probably an aggravating factor. This shows how complex these technical choices are, you have to think of operating savings, with a trade off on the gravity of black swan events...


Interesting. That said, the technique is not the issue here, losing a whole datacenter can always happen. This event would have been much less serious if all the four SBG* datacenters were not all so close to each other on the same plot of land.

They are so close to each other that they are basically the same physical datacenter with 4 logical partitions.


They are all annexes built up over time and a victim of their own success (the site was meant to be more of a edge node in size). The container based annexes were meant to be dismantled 3 years ago but profit probably got in the way


Rust (the video game) lost all EU server data w/o restore https://twitter.com/playrust/status/1369611688539009025


Now that is weird. I fully understand not having a backup service in place, but no data backup either?


TBF "data" here means the state of the gameworld which anyway resets entirely every 2 weeks or every month, so it's not exactly a big deal, everyone constantly start over in Rust.


Ah, I see. That makes sense then. I thought it's an everlasting game world, thanks for clarification.


Someone forgot the 3-2-1 rule for backups.

I don‘t get why people don‘t do offsite backup today as it‘s basically for free. AWS Glacier basically costs nothing at all.



I can't see anything about a fire suppression system mentioned? Doesn't OVH have one, except for colocation datacenters?

A fire detection system using eg. lasers and Inergen(or Argonite) for putting the fire out is commonly used in datacenters. The gas fills the room and reduces the amount of oxygen in the room so most fires are put out within a minute.

The cool thing is that the gas is designed to be used in rooms with people, so that is can be triggered any time. It is however quite loud, and some setups have been known to be too loud, even destroying harddrives.


At a datacentre I used to visit years ago, part of the site induction was learning that if the fire suppression alarm went off, you had a certain amount of time to get out of the room before the argon would deploy, so you should always have a path to the nearest exit in mind. The implication was that it wasn't safe to be in the room once it deployed, but I don't know for sure.


Inergen is used to lower oxygen to ~12.5 %, from normal 21 %. Most fires need more than 16 % oxygen.

Inergen consists of 52 % nitrogen, 40 % argon and 8 % carbon dioxide.

Carbon dioxide might sound strange, but it causes the heart to beat faster to compensate for the lowered amounts of oxygen.

The whole point of Inergen is to quickly put out fires where water/foam/powder isn't usable and the room might contain people.

It's cool stuff, and I was under the impression that basically all datacenters used it.


Presumably you would have trouble breathing as it displaced the oxygen in the room?


Yeah, that was always my understanding. GP saying "the gas is designed to be used in rooms with people, so that [it] can be triggered any time" made me second-guess that though.

Maybe there is a concentration of oxygen that is high enough for humans to survive, yet too low to sustain combustion?


The "gas" is more like a fog of capsules. FM-200 is a common one. Basically it has a Fire suppression agent inside crystals which are blasted into the room by compressed air. These crystals melt when they get over a certain temperature and therefore won't kill you; however, breathing that in isn't really pleasant.

Source: I've been in an FM-200 discharge


Does everyone carry an N95 or equivalent? I guess many of us have gotten good at donning them by now...


> Maybe there is a concentration of oxygen that is high enough for humans to survive, yet too low to sustain combustion?

Yes, supposedly (12% ish?). Can't say I'd be thrilled at the idea of testing it.


I heard that the pressure difference might rupture your ear drums.


A few years ago the company I work for installed suppressors on the Inergen system. It did trigger from time to time, which was tracked to the humidifiers. And yes -- it did destroy harddrives because of the pressure/sound waves before we installed the supressors. Haven't had any incidents after we fixed the humidifiers.

But Inergen (and other gas) is more or less useless if you allow it to escape too quickly. So the cooling system should be a fairly closed circut.

Edit: I'm also a Norwegian dude. :)


So how is it safe with people if there is no oxygen left to breathe? Reminds me of my first trip to a datacenter, where the guy who accompanied us said: "In the event of a fire this room is filled with nitrogen in 20 seconds. But don't worry: nitrogen is not toxic!" Well, I was a little worried :)


Newer systems like the ones mentioned above are designed to reduce the amount of oxygen in the room to around 12% (down from around 21%). That's low enough to extinguish fires, but allows people to safely evacuate and prevent people from suffocating if they're incapacitated.


> even destroying harddrives

So loud that it destroys hard drives... That's scary, are people's eardrums much more resistant?


A lot of DCs are loud enough that it’s quite pleasant to wear ear protection. Even better with headphones that play what you want to listen to.

Sometimes doubled up: big earmuffs with in-ear canal headphones and you might even pull off a phone call.


Are fires common in data centres? Specialized fire suppression tech seems to indicate that they are.


They aren't super common, but halon and other gas systems are just the right tool for the job. It can get inside the server chassis and doesn't damage equipment like a chemical application would. We won't know what went wrong at OVH until a proper post mortem comes out. These systems work by suppressing the flame reaction, but if the actual source was not addressed, it could reignite after awhile.


Yes, mostly due to two reasons:

* overall high energy density (lots of current flowing everywhere)

* the batteries for backup power are dangerous, and can easily(ish) overheat when activated.


The decision to design and implement specialised tech comes from a combination of how likely the risk is and the magnitude of the potential loss. Fires are not that common in DCs, but the potential loss can be enormous (as OVH is currently finding out).


What's the point of a fire suppression system that destroys what it should protect?


When it destroys one room but prevents fire spreading to whole building.


In a data center the fire suppression is mostly there to protect the servers.


Seems like data.gouv.fr [1], the government platform for open data is impacted ; we might not get the nice COVID-19 graphs from non-governemental sites ([2], [3]) today.

I can't wait for the conspiracy theories about how the fire is a "cover up" to "hide" bad COVID-19 numbers...

[1] https://twitter.com/GuillaumeRozier/status/13695724905996902...

[2] https://covidtracker.fr/

[3] https://www.meteo-covid.com/trouillocarte (Just wanted to share the "trouillocarte" - which roughly translates to "'how badly is shit hitting the fan today' map" ;) )


it has nothing to do with covid:

- https://www.usine-digitale.fr/article/le-health-data-hub-heb...

- https://www.genethique.org/la-cnam-refuse-le-transfert-globa...

- https://france3-regions.francetvinfo.fr/bretagne/donnees-med...

Here is more plausible conspiracy theory: USA/microsoft is behind the cyberattacks and the fire


Video here. This will be Disaster Recovery 101 material.

https://mobile.twitter.com/abonin_DNA/status/136953802824345...


Or disaster prevention. I'm curious how their fire suppression system failed so spectacularly..


I think they failed to install it.

They only mention smoke detection: https://www.ovh.com/world/us/about-us/datacenters.xml


In French but with pictures: https://lafibre.info/ovh-datacenter/ovh-et-la-protection-inc... - probably a different OVH data centre, but they clearly have sprinklers there. The argument they made against gas extinguishers is by the time they have to use it, the data is gone anyway, and it's only going to trigger in the affected areas. It's also far safer for the people working there.


That's very interesting find! Google translate seemed to do a good enough job on it for me. 8 years ago and there are people in that thread ripping on what they see in the photos.


Ouch, that's not pretty... but it seems that the fire was constrained to 1 or 2 sectors (inside the building) - per their updates

Not sure how good were the fire suppression systems of the building.


Looks like a lot of those containers/compartments are melted.


Almost 11 years ago (March 27 2010) in Ukraine datacenter of company Hosting.ua went in flames as clients were watching their systems go unresponsive at various rows of the datacenter.

Anti-fire systems didn't kick-in. The reason? For couple of days the system was detecting a little bit of smoke from one of the devices. Operators weren't able to pinpoint exact location, considered it a false alarm and manually switched anti-fire system off.


Sounds disturbingly similar to the story of how Chernobyl went into meltdown. Also in Ukraine!


> If your production is in Strasbourg, we recommend to activate your Disaster Recovery Plan.

Ouf.


“ Update 7:20am Fire is over. Firefighters continue to cool the buildings with the water. We don’t have the access to the site. That is why SBG1, SBG3, SBG4 won’t be restarted today.”

https://mobile.twitter.com/olesovhcom/status/136953578757072...


With water, they said.


Liquid cooling.



I sent this story to my colleagues and one of them asked "where is the FM200?"

I don't really know how FM200 systems work in data centres, but I'm guessing that if the fire didn't start from within the actual server room, FM200 might not save you? e.g. if a fire started elsewhere and went out of control, it would be able to burn through the walls/ceiling/floor of the server room, in which case no amount of FM200 gas can save you, right?

Another possibility, of course, is that the FM200 system simply failed to trigger even though the fire started from within the server room.

There is no published investigation details about this incident yet, I believe. Can somebody chime in about past incidents where FM200 failed to save the day?


I think most of these gases are or will eventually be banned in Europe because of their impact on the environment. I've seen newer datacenters use water mist sprays.


Nitrogen makes up 78% of the atmosphere, so I doubt it will be banned. Most datacenters don't actually use halocarbons despite the common "FM200" name.


You might be thinking of Halons, which are CFCs that depletes the ozone layer? They are mostly phased out worldwide but existing installations might still be in use.

FM200 is something else that is often used in modern builds (not just datacenters).


It seems that HFC are being phased out too: https://ec.europa.eu/clima/policies/f-gas/legislation_en


I've heard that one. I thought it mostly affects refrigerants, but I didn't notice that FM200 is also an HFC. There are other fire suppression gasses with a low global warming potential, which probably can still be used in the future.


How... what. What if the fire is electrical? You can't just go "well the triple interlocked electrical isolation will trip and cut the current" if a random fully-charged UPS decides to get angry...



Ah, so it's possible that they also used a water sprinkler system at SBG2. But still, I wonder how the fire protection system (water sprinkler, FM200, or otherwise) not save SBG2?

It doesn't really surprise me that the machines are dead, but the whole place being destroyed is much more surreal.



"Everyone is safe. Fire has destroyed SBG2. A part of SBG1 is destroyed. The firefighters are protecting SBG3. No impact on SBG4". Tweet from Octave Klaba, founder of OVHcloud. "All our clients on this site are possibly impacted"

Tout le monde est sain et sauf. Le feu a détruit SBG2. Une partie de SBG1 est détruite. Les pompiers protègent actuellement SBG3. Pas d’impact sur SBG4 », a tweeté Octave Klaba, le fondateur d’OVHcloud, en désignant les différentes parties du site. « Tous nos clients sur ce centre de données sont susceptibles d’être impactés », a précisé l’entreprise sur Twitter.


Wow. That equipment is going to be very hard to replace right now too.


Reminder to not only have backups, but also have some periodic OFFLINE backups.

If your primary is set up with credentials to automatically transfer a copy to the backup destination over the network, what happens if your primary gets pwned and the access is used to encrypt or delete the backup?

Secondly, test doing restores of your backups, and have methods/procedures in place for exactly what a restore looks like.


> what happens if your primary gets pwned and the access is used to encrypt or delete the backup?

Append-only permissions. We do this in S3 for that specific reason. S3 lifecycle rules take care of pruning old backups.

You can also build a pull-based system where auth resides on the backup system, not the production system.


Wow. Weren't they just angling for an IPO? I wonder how much of this was insured and what the impact is to their overall operations.


I think they announced their IP plans yesterday [1], which is probably the worst timing one can have (if there even is a good timing for a datacenter burning down, probably there isn't).

If they have a good insurance I'm confident this will have little impact on their operations, I really hope they do. I host a few components on OVH/SoYouStart dedicated servers, luckily not mission critical, but still had rather good experience with them, especially in terms of price to performance.

[1] https://www.reuters.com/article/amp/idUSKBN2B01FM


The publicity damage alone will be on par with (if not bigger) their replacement costs. I wouldn’t be surprised if they had to rebrand.

The timing is so bad that it becomes almost suspicious. When is the best time to sabotage your competitor? When they are the most visible.


> The publicity damage alone will be on par with (if not bigger) their replacement costs. I wouldn’t be surprised if they had to rebrand.

Honest question: which publicity damage?

A fire in a datacenter is very much part of the things you should expect to see happen when you operate a large number of datacenters and will obviously cause some disruption to your customers hosting physical servers there.

Provided the disruption doesn't significantly extend to their cloud customers and doesn't affect people paying for guaranteed availability (which it shouldn't - OVH operates datacenters throughout the world), this seems to me to be an unfortunate incident but not a business threatening one.


Most people I feel would expect fire suppression to kick in and prevent the whole data center (and the adjacent ones) from catching on fire. The fact that it didn't is concerning regarding their operations since they build their own custom data centers. The fire isn't the issue, how much damage it did is the issue. So one can ask if there was there a systematic set of planning mistakes of which this is just the first to surface?


SBG is a small part of their operations. I doubt it will have a lasting issue on their image


Dang, I have a lot of respect for Octave and what he has created. https://twitter.com/olesovhcom?s=21


After the total loss of one data-center I would tend to disagree with this statement.


Wow, SBG3 seems to be OK: «Update 11:20am: All servers in SBG3 are okey. They are off, but not impacted. We create a plan how to restart them and connect to the network. no ETA. Now, we will verify SBG1.»

https://twitter.com/olesovhcom/status/1369592437585412097


Lichess was affected by this fire: https://twitter.com/lichess/status/1369543554255757314

But they seem to be back up


Like something out of Mr Robot


To my experience, backups in companies are barely done.

Companies want quick money, they push people to skip important IT operations, like disaster recovery plans.

And backups are the least monitored systems.


Interesting! I got the news from my local package courier website. They warn, their services can be unreliable due to the fire at OVH.

It's all connected.


Last update

https://twitter.com/olesovhcom/status/1369535787570724864?s=...

"Update 7:20am Fire is over. Firefighters continue to cool the buildings with the water. We don’t have the access to the site. That is why SBG1, SBG3, SBG4 won’t be restarted today."


Most modern data centres wouldn't have had this issue, at least in in Australia they use Argonite suppression systems, these work by using a gas that is a mixture of argon and nitrogen that suppresses fire by depleting oxygen in the the data hall.


I'm seeing quite a lot of repeated sentiment throughout the comments that Halon is illegal and is no longer used.

Is the situation "Halon is legal in Australia" or "Halon isn't actually illegal per se if you don't use a lot of it"?


Halons are being phased out because, being CFCs, they deplete the ozone layer.

Other gases like argon or FM200 are not Halons.


Halon and Argonite are unrelated.


I don't think Aussie dc's are all the same. Globalswitch in Sydney uses inergen but Equinix is using water at least at SY1 but I'm reasonably sure that GS is much older


Hugops to OVH folks, hang in there, you're my favorite European data center operator.


Came across this on r/France

https://i.imgur.com/epj1Lue.png

Translation :

We lost our Gitlab and backups...

And the automatic backups that had been put in place no longer worked so a priori we lost everything...


He posted an update, seems SBG2 is totally destroyed.

Ouch https://twitter.com/Onepamopa/status/1369484420982407173


> DID ANYTHING SURVIVE? ANYTHING AT ALL?????? OUR DATA IS IN SBG2. WHAT DO WE DO NOW ?!?!

Double check those backups, folks.


One of the things I'm still extremely grateful for is that I learnt the basics of computer science from an ex-oracle guy turned secondary school teacher, who wasn't the best programmer let's say but who absolutely drilled into us (I was probably the only one listening but still) the importance of code quality, backups, information security etc.

Nothing fancy, but it's the kind of bread and butter intuition you need to avoid walking straight off a cliff.

He also let me sit at the back writing a compiler instead of learning VB.Net, top dude


Trust but verify. As a developer it doesn't matter what sysadmins or anyone else says about backups of your data; if you haven't run your DR plan and verified the results, it doesn't exist.


> Trust but verify.

If it's business critical should you even trust at all?


yeah backups are a case of "don't trust, verify"


what do we do? Well for starters don't rely on a single geographic location chuckles


> We recommend to activate your Disaster Recovery Plan.

What percentage of organizations have these?


Some plans have a single step: 1) panic.


Short answer: not enough.


I guess we're about to find out (or, rather, they are).


There have been a handful of talks at computer security conferences talking about setting up physical traps in server chassis (such as this one: https://www.youtube.com/watch?v=XrzIjxO8MOs). Since seeing those I've been waiting for some idiot to try something like that in a physical server and burn down a data center.

There is NO evidence that is what happened here, and I don't think OVH allows customers to bring their own equipment making even less likely. Still I wait and hope to hear a root cause from this one.


> At 00:47 on Wednesday, March 10, 2021, a fire broke out in a room in one of our 4 datacenters in Strasbourg, SBG2. Please note that the site is not classified as a Seveso site.

> Firefighters immediately intervened to protect our teams and prevent the spread of the fire. At 2:54 am they isolated the site and closed off its perimeter.

> By 4:09 am, the fire had destroyed SBG2 and continued to present risks to the nearby datacenters until the fire brigade brought the fire under control.

> From 5:30 am, the site has been unavailable to our teams for obvious security reasons, under the direction of the prefecture. The fire is now contained.


This is horrible and a sobering reminder to do the things we don't enjoy or consider -- disaster recovery.

How many of us here plan Fire Drills within our teams and larger organizations?


I don't know whether my 10+ year side project with 225,000 users(www.webminal.org) gone forever! :(

I have backup snapshots but its stored in ovh itself :( hoping for a miracle!


If your backup snapshots are stored through OVH's normal backup functionality, then create a new server at e.g. RBX now, and restore from those backups. That'll take a few hours and it'll all be up again quickly.


Really sorry to hear, hope you get it restored. I cannot judge - I use the default backup options in Azure and hope they store it in another data centre but never though to check too hard. This is very bad luck.

Hopefully you had the code in GitHub but that still leaves the DB. It looks like your has something to do with command line or Linux lessons so not sure how much user data is critical? Maybe you can get this up and running again to some extent.


What was the reason of storing the backup on the same server? To allow for rollbacks in case of data corruption or some changes gone wrong?


I think they’re talking about the backup services provided by OVH, in which I believe they’re stored in RBX.


Yes, I thought rollbacks will be much easier in case of data loss. :sob:


I have a VPS in SBG1.

So far I haven't got any communication from OVH alerting me about this. I think that's the first thing they should do, alerting their customers that something it's happening.

Anyway I was running a service that I was about to close, so this may be it. I do have a recovery plan, but I don't know it it is worth it at this point.

I'm never using OVH again. The fire can happen, but don't ask me about my recovery plan, what about yours?


There's a big link from their homepage: https://www.ovh.ie/news/press/cpl1786.fire-our-strasbourg-si...

It says they've alerted customers, but I expect some have been missed, through inaccurate email records, email hosted on the systems that have been destroyed etc.


Was the building made from stacked shipping containers? Containers are such a budget-friendly and trendy structural building block these days. They even click with the software engineers - "Hey, it's like Docker".

Containers would seem to be at a disadvantage when it comes to dissipating, rather than containing, heat. I hope improved thermal management and fire suppression designs can be implemented.


OVH use a custom water cooling solution they claim enables these niche designs and increases rack density.

As another comment says, that density probably exacerbated the issue here.


I could be wrong, but I understood from other comments that only SBG1 used containers.


This event reminded me of the fire at The Planet a while back: https://www.datacenterknowledge.com/archives/2008/06/01/expl...


Someone told me that there were a lot of warez sites hosted with OVH.

Anyone know of any casualties?


My VPS in SBG3 stopped pinging around 9am.

My impression is that they tried very hard to maintain uptime, which was probably a bad idea when we see the extent of the damages. This VPS just hosts external facing services and is easy to set back up.


Been there, done that.

The 2nd worst thing, IF you happen to catch it soon and control it, is that the temp raise triggers many alerts and automatic controls.

So when it's controlled, it's still a real nightmare. Here, firemen could not even control it...


For the curious - this is what it looks like when a fire suppression system activates in a (small) server room.

https://youtu.be/DrDU4UQUwKg?t=60


I see several people talking about advanced backup systems for businesses. I do not have a company, I work as a freelancer, I am Brazilian and the current 70 euros of tuition I was paying was already compromising my income, since my local currency is quite devalued against the dollar. Then imagine the situation. my websites are down, the only backup I have was a copy of the vps that i made in november last year because i was with the intention of creating a server at my house, since it was getting expensive to maintain this server at OVH. It would be unacceptable for a company of this size not to have its servers backed up or to keep them in the same location as the incident, since their networks are all connected. I hope you have a satisfactory and quick solution to this problem.


If OVH backed up everything, the cost of the service would be double.

Many customers don't need a backup, so it's up to each customer to arrange their own backups — perhaps with tools and services provided by the hosting company, or their own solution.

Running a company with no backup (for cost or any other reason) is very risky, as some people will have found out today.


I had my server in SBG2 and sadly the backup failed since the end of january. Yep it is my mistake for not checking the backups. Now I lost about 1 month of data.

The only good thing is that my backup was offsite.

Does OVH offer automatic snapshots for VPS? I know Hetzner does it for 20% addidation of the cost of the server. If they do the next question would be if they are destroyed too?


They do offer automatic backups, and they're offsite (RBX in this case).

Here's more information on them: https://docs.ovh.com/gb/en/vps/using-automated-backups-on-a-...

With OVH the price can depending on the situation double your cost, if you've got a 3€ VPS and a 3€ backup configuration (as the price depends on size)


Well, not having managed backups is obviously part of choosing to go bare-metal. They do have triple-redundancy backups in their cloud offerings. Nobody to blame but yourself.

Also, if you’re hosting clients’ static websites, you were burning your money, there are way cheaper options out there (and fully managed).


Hi,

Does anyone recommend a mail provider that implements IMAP failover and replication across different sites ?

My mail account is hosted by OVH, it has been down for hours and from what I read I may have to wait for another 1 or 2 days.

Thanks !


Interesting. I wonder if the cladding was a major problem here? It looks like it has all burnt out and could have had the fire spread extremely rapidly on the outside.


The cladding was metal so very unlikely it contributed to the fire spreading.


ACM was the reason the grenfell disaster happened in the UK.


“ACM” is a plastic sheet with aluminum skins. It would be plain idiotic to cover your DC in that (and residences too, but I digress).

You can see from pictures of the OVH fire aftermath that the outer plates have melted holes in them characteristic of metal.


Did OVH claim that SBG1 and SBG2 were isolated failure domains? Despite them just being different rooms in the same building?


They are different physical buildings. OVH does not generally claim anything regarding AZ or distance. SBG1, 2, 3, etc is just denoting the building your server is in - they are not like AWS style AZ or similar, quite literally just building addresses.

I have used them for years and I don't believe they've ever said anything like deploy in both SBG1 and SBG2 for safety or availability, because you don't get that choice.

When you provision a machine (eg via API) they tell you "SBG 5 min, LON 5 min, BHS 72h" and you pick SBG and get assigned first-available. There is no "I want to be in SBG4" generally.


In fact, OVH themselves host e.g. backups of SBG in RBX.


They're separate buildings, with separate power systems, just standing next to another. Next to SBG2 are also SBG3 and SBG4.


Those buildings weren't quite as far apart as they should have been if a fire in one requires all 4 of the others to be turned off...


Not disagreeing with you but the firefighters probably shut down all power onsite as soon as they arrived.


Apparently they are more like (temporary) extensions of the main building than separate DCs.


Apparently hey are more like (temporary) extensions of the main building than separate DCs.


Late to the party, but ... context? Everyone is talking like OVH or the SBG2 buildings are well-known and common knowledge.



Always, always keep local backups folks.


“Local” meaning on the same server, just in another folder, right?


Sorry, I should've phrased it better.

Local as in your home/office. While your application may run in AWS or whatever remote server, its necessary to have copies of your data that you can physically touch and access.

One main deployment, one remote backup and one onsite physically accessible backup.


The best time to test your backups is before the production server dies in a fire.


The photos look like the building had external cladding, wonder if that contributed to the size of the blaze [1].

[1] https://en.wikipedia.org/wiki/Grenfell_Tower_fire


To be honest, this gives me a little schadenfreude. OVH is the most notorious host that refuses to act on abuse complaints for phishing sites.


Finally a legitimate reason for BSD sysadmins to run poweroff -n!


I will never gonna financially recover from this..


It’s a Thermal Event not a fire


holy shit my 3$ VPS!!

nvm not this dc!


my VPS is gone, hmm


Don't they build datacenters with FM200 anymore? Even the antique Stanford datacenter I did some work in had bottles of it all over.


(site a)---[replicate local LUNs/shares to remote storage arrays]--->(site b), (site a)---[replicates local VMs to remote HCI]--->(site b), (site a)---[local backups to local data archive]--->(site a), (site a)---[local data archive replicates to remote data archive]--->(site b), (site b)---[remote data archive replicates to remote air gapped data archive]--->(site b), (site a)---[replicates to cold storage on aws/gcp/azure]--->(site c), (site c)---[replicate to another geo site on cloud]--->(site d)

scenario 1: site a is down plan: recover to site b by most convinent means

scenario 2: site b is down plan: restore services, operate without redundancy out of site a

scenario 3: site c is down plan: restore services, catch up later. continue operating out of site a

scenario 4: site b and c down plan: restore services, operate without redundancy out of site a

scenario 5: site a and b down plan: cross fingers, restore to new site from cold storage on expensive cloud VM instances

scenario 6: data archive corrupted ransomware plan: restore from air gapped data archive, hope ransomware was identified within 90 days

scenario 7: site b and c down, then site a down plan: quit

scenario 8: staff hates job and all quit plan: outsource

scenario 9: and so on...


The cloud is on fire.


elliot..is it you?


not to be a conspiracist, but are they still hosting wikileaks data? https://en.wikipedia.org/wiki/OVH#WikiLeaks


Looks a little bit like Fukushima. I hope the clean up doesn't take that long though..


Is that a New Serverless Platform everyone is talking about recently?


I just recently started moving some services for my business to one of OVH's US-based data centers. Should I take this fire as evidence that OVH is incompetent and get out? I really don't want AWS, or the big three hyperscalers in general, to be the only option.


IMO you should take this fire as evidence, that you need to have (working!) backups wherever you host your data. AWS, GCP Azure are not fire resistant, same as OVH. I don't know if OVH is more or less competent than big three, I choose to trust no one.


I read multiple times that they didn't even have sprinklers, only smoke detectors in their EU datacenter(s). I'm 100% sure AWS, Azure and Google have better fire prevention.


This thread has people saying they have sprinklers, don't have sprinklers, have / don't have gas suppression, and have puppies / actually have toilets.

Wait for the misinformation hose to dry up, and decide in a few weeks.

https://us.ovhcloud.com/about/company/security


And this is why the big 3 will continue to dominate. AWS, Microsoft and Google can throw in a lot more money at their phyiscal infrastructure than any other cloud provider.

After this sorry episode, I dont think any CTO or CIO of any public company will be able to even consider using the other guys.

edit: I am not implying that we put all eggs in one basket with no failover and dr. I am implying the big cos will pay 2x premium on infrastructure to project reliability.


I could replicate my whole infrastructure on 3 different OVH datacenters, with enough provision to support twice the peak load - it would still be cheaper than a single infrastructure at AWS, and I would get a better uptime than AWS: https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...


I agree with you 100%. However, I did state * any CTO or CIO of any public company* ... the executives don't worry about costs , they worry about being able to * project reliability * .


Executives that worry about reliability would insist on deploying on multiple data centers, which would make the project more reliable than any single AWS availability zone.

Also, cost matters if the AWS bill is one of the company's top expenses.


reserved a1.large instances are about half the price of OVH's b2-7 instance. a1.xlarge are still cheaper (and larger). So you get more raw compute per dollar on AWS.

What?


If you need that large machines, and are willing to use reserved instances, you'd go with dedis on OVH instead of VMs. Which is significantly cheaper


OVH dedicated instances start at about the size of an a1.metal instance, which is ~30% more than the comparable OVH instance, but you can get discounts in various ways.

Or you could use t4g.2xlarge, which is cheaper. There's no situation where OVH is 3x cheaper (I mean maybe if bandwidth is your thing, but IDK).


I'm curious how you make your comparisons. Here is mine:

AWS a1.metal: $293 per month for 16 ARM cores, 16GB memory

OVH Rise-4 [1]: $131 per month for 16 Intel Xeon cores, 128GB memory, 500Mbit/s

And the ARM CPUs on the AWS instances are nowhere near the intel cores here, so maybe this configuration is comparable:

OVH Rise-APAC [2]: $57 per month for 4 Intel Xeon cores, 32GB memory

[1] https://www.ovhcloud.com/fr/bare-metal/rise/rise-4/

[2] https://www.ovhcloud.com/fr/bare-metal/rise/rise-apac/


Performance of those AWS instances comes nowhere close despite the specs.


If all of your infrastructure is in one data center, you're on a disaster clock no matter who you choose.


With Google arbitrarily killing accounts, and with Amazon showing that they’ll do the same if it’s politically expedient, I’m not sure I’d trust the big three, either. It’s a case of “pick your poison”.


I don’t know what ovh is and going to the site point me to a speed test with no information.


Imagine AWS, with less features, but 10x-100x lower prices. And now you know why until a few years ago they were larger in traffic, customers, and number of servers than even AWS.


They’re a web/hosting service provider, like AWS or GCP, but with less services. They’re much more popular in European countries.


You went to ovh.net instead of ovh.com.

Largest hosting provider in Europe, probably top10 or top5 in the world.


French AWS/GCP is my understanding.


They are really more like Hetzner. They have "cloud", but most of the business is dedicated servers. They also operate kimsufi.com and soyoustart.com.

They do have APAC, Canadian, and US data centers as well.


Honestly just for the memes. https://isovhonfire.com


Probably not a good look for this to be the first thing another hosting company thinks to put up in response.


OVH aren't nice in that regard, and have trolled competitors to "leave it to the pros" before when there were serious incidents ( of which OVH have had their fair share), so it's not surprising.


/shrug/ We have our infrastructure partially in OVH, I see it as a friendly jab at them and a way to get updates without having to navigate to twitter.


Did this already exist or was this thrown up insanely fast.


whois[0] shows it was registered on March 10, so thrown up insanely fast.

[0] https://who.is/whois/isovhonfire.com


Threw it up really quick.



I'm not sure if it's because my tolerance of Graham Linehan has snapped or not, but I barely laugh at the IT Crowd any more. As with other GL shows I find it's just mostly held together but the cast's delivery and such

The laugh track and the writing is honestly dated even by the standards of Dads Army.


I don't remember the details, but I think that season 2 kind of retroactively ruined season 1. They used to have all those O'Reilly and EFF stickers, and working at a help desk at the time, it felt very authentic. Then everything got super nice in season 2 -- leather couches, people were dressing nicely, etc. It kind of lost its charm. You can't rewatch it because you know Denholm is just going to randomly jump out of a window.

(Having said that, I think "Fire" was a memorable episode that is still amusing. The 0118911881999119 song, "it's off, so I'll turn it on... AND JUST WALK AWAY".)

It might have been ahead of its time. Silicon Valley was well received and is as nerdy and intricately detailed as Season 1 of the IT Crowd. "Normal people" thought it was far out and zany. People that work in tech have been to all those meetings. And, a major character was named PG!


> They used to have all those O'Reilly and EFF stickers, and working at a help desk at the time, it felt very authentic. Then everything got super nice in season 2 -- leather couches, people were dressing nicely, etc. It kind of lost its charm.

That sounds like a pretty realistic allegory of the last two decades in Free Software (or software in general, or the web...)


I only just realized the Paul Graham/Peter Gabriel easter egg...


The IT Crowd's comedy became dated incredibly quickly, just like Father Ted's.

Comedies that came later ditched the laugh track. They had to work harder to get viewers at home to laugh, but ultimately a bunch of them (starting with The UK Office) hold up much better as a result.


Unfortunately A lot of people are going to find out the hard way today why AWS/GCP/Big Expensive Cloud is so expensive (Hint: they have redundancy and failover procedures which drive up costs).

Keep in mind I’m talking not of “downtime” but of actual data loss which might affect business continuity.

This is really tragic. I’m hoping they have some kind of multi regional backup/replication and not just multi zones (although from the twitters it appears that only one of the zones was destroyed however the others don’t seem to be operational atm).


“Big cloud” has had fires take out clusters, and somehow they manage to keep it out of the news. In spite of the redundancy and failover procedures, keeping your data centers running when one of the clusters was recently *on fire* is something that is often only possible due to heroic efforts.

When I say “heroic efforts”, that’s in contrast to “ordinary error recovery and failover”, which is the way you’d want to handle a DC fire, because DC fires happen often enough.

The thing is, while these big companies have a much larger base of expertise to draw on and simply more staff time to throw at problems, there are factors which incentivize these employees to *increase risk* rather than reduce it.

These big companies put pressure on all their engineers to figure out ways to drive down costs. So, while a big cloud provider won’t make a rookie mistake—they won’t forget to run disaster recovery drills, they won’t forget to make backups and run test restores—they *will* do a bunch of calculations to figure out how close to disaster they can run in order to save money. The real disaster will then reveal some false, hidden assumption in their error recovery models.

Or in other words, the big companies solve all the easy problems and then create new, hard problems.


I'm curious what references or leads I might follow to learn more about these fires and other events you mention.


Get a job working at these companies and go out for drinks with the old-timers.


You know, those are excellent observations. But they don’t change the decision calculus in this case. Using bigger cloud providers doesn’t eliminate all risk, it just creates a different kind of risk.

What we call “progress” in humanity is just putting our best efforts into reducing or eliminating the problems we know how to solve without realizing the problems they may create further down the line. The only way to know for sure is to try it, see how it goes, and then re-evaluate later.

California had issues with many forest fires. They put out all fires. Turns out, that solution creates a bigger problem down the line with humongous uncontrollable fires which would not have happened if the smaller fires had not been put out so frequently. Oops.


I encourage you to have a look at the operating income that AWS rakes in.

Sure, the amount of expertise, redundancy and breadth of service offerings they provide is worth a markup, but they are also significantly more expensive than they need to be.

Thanks to being the leader in an oligopoly, and due to patterns like making network egress unjustifiably expensive to keep you (/your data) from leaving.


I think the question here, then is of subjective value.

AWS may charge more for egress, but that’s not high enough for it to be a concern for most clients.

A bigger, independent concern is probably that there should be sufficient redundancy, backups and such that allows for business continuity. (Note again that I’m not saying that all companies make full use of these features, but those that care for such things do. Additionally, I’ve honestly never heard of an AWS DC burning down. Either it doesn’t happen frequently or it doesn’t have enough of an effect on regular customers, both of situations are equivalent for my case).

Most businesses choose to prioritize the second aspect. Even if they have to pay extra for egress sometimes, it’s just not big enough of a concern as compared to businesses continuity.


An availability zone (AZ) in AWS eu-west-2 was flooded by a fire protection system going off within the last year. It absolutely did affect workloads in that AZ. That shouldn't have had a large impact on their customers since AWS promote and make as trivial as is viable multi-AZ architectures.

Put another way: one is guided towards making operational good choices rather than being left to discover them yourself. This is a value proposition of public clouds since it commoditises that specialist knowledge.


What surprised me most about today's fire is that their datacenters have so little physical separation. I expected them to be far enough apart to act as separate availability zones.


Hm, I can't find anything in google about this flooding incident. Can you share some details / source?


I've never heard of any data centre burning down (and I work in this industry), so never hearing of an AWS DC burning down isn't really saying anything about AWS.


I remember hosting.ua DC mysteriously "catch fire"


AWS and GCP are also prone to same kind of data loss if the AZ you are operating in goes down.

They don't automatically geo-replicate things. You still need a backup for the torched EC2 instance to be able to relaunch in another AZ/region.


That's true, but it seems whole of SBG region for OVH is within same disaster radius for one fire... with SBG2 destroyed and SBG1 partly damaged.

"The whole site has been isolated, which impacts all our services on SBG1, SBG2, SBG3 and SBG4. "

Wonder if those SBGx were advertised as being the same as "Availability Zones" - when other cloud providers ensure zones are distanced enough from each other (~1km at least) to likely survive events such as fire.


They were not and never advertised as anything similar to AZ. You could not deploy in SBG1,2,3, etc. You only pick city = Strasbourg at deploy time. It's merely a building marker.


> but it seems whole of SBG region for OVH is within same disaster radius for one fire

SBG is for Strasbourg. That's not a region. It's a city. Obviously, SBG1 to 4 are in the radius of one fire. It's four different buildings on the same site.


The buildings are VERY close to one another.

https://cdn.baxtel.com/data-center/ovh-strasbourg-campus/pho...


That seems, uh, problematic.


Why? They aren't making any sort guarantees with those locations, those aren't advertised as separated fault zones. Those buildings are meant as expansions for existing site.


Thats a fair point. If OVH does market them as AZs then it's disingenuous and liable to suits IMO.


No, it isn't, as there's no clear cut definition of what an availability zone is.


If the data is that critical, surely you would be backing it up frequently and also mirror it on at least one geographically separate server?

I use a single server at OVH, and I'm not in the affected DC, but if this DID happen to me I could get back up and running fairly quickly. All our data is mirrored on S3 and off site backups are made frequently enough it wouldn't be an issue.

Plus, you still need to plan for a scenario like this even with AWS or any other cloud provider. It is less likely to happen with those, given the redundancy, but there is still a chance you lose it all without a backup plan.


Yup, I've never heard of a fire taking out a Big Cloud DC. They actually know what they're doing and don't put server racks in shipping containers stacked on top of each other. If you want quality in life, sometimes you have to pay for it.

Personally I'll continue to use these third world cloud providers. But I like to live on the edge.



> Apple fire

Their solar panels on the roof caught on fire. Presumably there was no actual disruption in service computing-wise.

> Google fire

2006, still a young company at that point. Also this was way before GCP where they were responsible for other people's services.

> AWS fire

Data center caught on fire while it was under construction. Drunk construction workers messed up, not AWS. Presumably no computers were even inside yet. This is completely different from OVH completely burning down while all their customers' data went up in flames.

If these are the best examples of Big Cloud messing up, this is quite the endorsement.

It's funny how the OVH CEO tells everyone to go activate their DRPs while the company didn't have enough foresight to install fire suppression systems.


These are just the first examples I could find. Fires account for 20%+ of all DC outages in the industry. But taking the time to find a reason to invalidate each one, it seems like you have your mind set.


White collar jobs are not paid well in the EU. If a saleswoman from a nearby shop has the same salary, why should I care about the quality of my work at all? The entire data center burned down? Fine! There are a lot of other places with the same tiny salary.

As you can see below, a developer expert is only making 24K-63K Euro per year at OVH (in US dollar it's almost the same amount):

https://www.glassdoor.com/Salary/OVHcloud-France-Salaries-EI...

after paying taxes you will only get a half of that amount.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: