Man accidentally 'deletes his entire company' with one line of bad code

nickpsecurity · on April 14, 2016

What's most epic about this is it's in the UNIX Hater's Handbook. One of its rants was how better-designed systems would warn you if you were going to nuke your whole system. The reason is that a command to wipe the whole system was more likely a mistake than developer or admin's intent. UNIX would do it without blinking. Inherently unsafe programming and scripting combined with tools like that meant lots of UNIX boxes went kaput.

And today, over two decades later, a person just accidentally destroyed his entire company with one line without warnings on a UNIX. History repeats when its lessons aren't learned. This problem, like setuid, should've been eliminated by design fairly quickly after it was discovered.

http://esr.ibiblio.org/?p=538

EDIT: Added link to ESR's review of UNIX Hater's Handbook which links to UHH itself. Nicely covers what was rant, what was fixed, and what remains true. Linking in case people want to work on the latter plus my sour relationship with UNIX. :)

vinceguidry · on April 14, 2016

One of the more fascinating aspects of human history is how much effort we're willing to devote to creating varied senses of safety, even if that safety is only an illusion.

Now, we can call this a failure of design, but really, people who rely on technology they don't understand can't be saved by good design. Sure, this particular case could be fixed by disallowing the recursive flag on the file system root, but safety is never going to be able to be the primary design concern of any technological system.

Imagine if a sword were made safety as first-class concern. You can't design a sword that can be used safely by the untrained. No weapon can be, training with a weapon is a prerequisite for safely using it. Similarly, every technology has to be understood by those using it. If you don't you're just inviting trouble.

For a business using technology, the needs are actually fairly straightforward. You need an understanding of what needs to be backed up, and a process for performing the backups. If you've picked the former right, (backing up human-readable information rather than data only readable by software programs that might go away in a crash) then risk is minimized.

xg15 · on April 14, 2016

This is a strange post. You imply that, only because it's impossible to absolutely prevent all kinds of disasters, no efforts in safety should be taken at all.

By the same logic, you could strip away all the airbags, seatbelts, comfy seats and assistance systems of modern cars: After all, accidents still happen and safety mechanisms might even lure drivers into more reckless behavior. (This is in fact happening with seatbelts)

I think it's less useful to think about absolute safety than about which failures are likely to occur and how effective our measures against those specific failures are.

The --force flag is obviously not an effective measure against root deletions, otherwise we wouldn't have so many stories about it. My theory is that there are three reasons for it:

- As other people wrote, if you frequently batch-delete files, you get trained very quickly to always use -f as plain rm is very annoying to use for large sets of files. Unlike other flags, -f won't make you stop and think. This could be fixed by making rm-without-f actually useable - for example by only asking once and not for every file like, oh I don't know... Windows.

- rm can interact with shell parsing in very intransparent and fatal ways. My guess is that most root deletions happen similar to this post: not a literal rm -rf / but some unfortunate variable interpolations where the author didn't realize that they can evaluate to "/". That's a very unobvious point of failure that takes a lot longer to learn than just using rm. Therefore rm should absolutely warn about it.

- there is actually an expectation that rm could be safe as most deletes you do on a modern system are reversible - either because you have a " recycle bin" or a backup. So a warning would make sense to counter that expectation.

TeMPOraL · on April 14, 2016

'vinceguidry actually makes a pretty good point. It's one thing to cover potential stupid mistakes with safety features. But beyond some point, safety starts to oppose utility - i.e. a perfectly safe car would be a simple chair. A perfectly safe software is also one that is totally useless for anything.

It's important to consider when designing software that safety should be about gracefully handling mistakes, and not something that should lure the user into false sense of not having to know what they're doing. Unfortunately, the latter attitude is what drives todays' UX patterns and software design in general, which is a big part of why tech-illiterate people remain tech-illiterate, and modern programs and devices are mostly shiny toys, not actual tools.

nickpsecurity · on April 15, 2016

It's true that safety and also security can impair the usefulness of something past a certain point. It's also irrelevant to our current topic given the existence of systems that don't self-nuke easily. This is a UNIX-specific problem that they've fought to keep for over 20 years with admittedly some improvement. There were alternatives, both UNIX setups and non-UNIX OS's, that protected critical files or kept backups from being deleted [at all] without very specific action from an administrator. And nobody complained that they couldn't get work done on or maintain a VMS box.

So, this isn't some theoretical, abstract, extreme thing some are making it out to be. It's a situation where there's a number of ways to handle a few, routine tasks with inherent risk. Some OS's chose safer methods w/ unsafe methods available where absolutely necessary. UNIX decided on unsafe all around. Many UNIX boxes were lost as a result whereas alternates rarely were. It wasn't a necessity: merely an avoidable, design decision.

xg15 · on April 14, 2016

I'm glad we have the same opinion then - as I said, it's not very useful to reason about "perfect safety".

It's certainly possible to make a product "safer" then necessary and hinder utility (though I think "safety" is the wrong concept to look at here - see below) but if the common opinion of your product from tech-illiterate people is "complicated and scary", I think you can be pretty sure that you are still a long way away from that point.

In fact, some versions of rm do add additional protection against root deletions, e.g. the --no-preserve-root flag. What utility did that flag destroy?

I believe if you really want to make people more tech-literate (which today's apps are doing a horrible job of, I agree), you have to give them a honest and consistent view of their system, yes. But you also have to design the system such that they can learn and experiment as safely as possible and can quickly deduce what a certain action would do before they do it.

Cryptic commands, which are only understandable after extensive study of documentation, and which oh by the way become deadly in very specific circumstances don't help at all here.

nickpsecurity · on April 15, 2016

"Cryptic commands, which are only understandable after extensive study of documentation, and which oh by the way become deadly in very specific circumstances don't help at all here."

Exactly. That's another problem that was repeatedly mentioned in UNIX Hater's Handbook. It still exists. Fortunately, there's distro's improving on aspects of organization, configuration, command shells, and so on. I'm particularly impressed with NixOS doing simple things that should've been done a long time ago.

vinceguidry · on April 14, 2016

> You imply that, only because it's impossible to absolutely prevent all kinds of disasters, no efforts in safety should be taken at all.

Not at all. We should absolutely work to make things safer. But we need to be realistic and temper our sense of idealism. Nothing was going to save this guy from disaster, if it wasn't 'rm' it just would have been something else.

My point is that you can't expect safety features to obviate the need to know what you're doing.

nickpsecurity · on April 15, 2016

This could've stopped it unless he specifically coded it to destroy system files:

https://launchpad.net/safe-rm

jacobush · on April 14, 2016

Sounds very fatalistic. All people killed in car accidents would also have died from something else?

TeMPOraL · on April 14, 2016

Many probably would - it's likely that the insanely reckless attitude towards motor vehicles will also reflect in other areas of life.

vinceguidry · on April 14, 2016

I don't know exactly how you'd plan on fighting reality. The fact that this guy was going to get hosed eventually isn't some justification, it's fact. People who do stupid things get burned.

If someone wants safety features, let them pay for them. If someone wants to add one, sure, so long as I can remove it if it gets in my way. Who knows, maybe they'll actually be worth having. But I'm not going to lose sleep over every idiot who ruins his life over something he didn't or couldn't learn about. There's absolutely nothing you can do to save stupid people from making stupid decisions.

Maybe I remove a safety feature I don't need and hurt myself with it. Now I'm the moron. Hopefully I learn from it. Nothing you could have done about that either.

Show me something foolproof, and I'll show you a greater fool.

bryik · on April 14, 2016

In the UNIX Hater's Handbook, defenders of rm consider accidental deletion a "rite of passage" and remark that "any decent systems administrator should be doing regular backups" (see page 62). The author's response is funny:

“A rite of passage”? In no other industry could a manufacturer take such a cavalier attitude toward a faulty product. “But your honor, the exploding gas tank was just a rite of passage.” “Ladies and gentlemen of the jury, we will prove that the damage caused by the failure of the safety catch on our chainsaw was just a rite of passage for its users.” “May it please the court, we will show that getting bilked of their life savings by Mr. Keating was just a rite of passage for those retirees.” Right.

I'm surprised how relevant parts of this book are 22 years later.

http://www.vbcf.ac.at/fileadmin/user_upload/BioComp/training...

erik14th · on April 14, 2016

Is there an alternative to allow scripted destructive actions without the risk of deleting important stuff?

Modern OS's will warn you if you try to delete stuff, but you can still ultimately do it anyway, I don't see it as something particular to UNIX.

The only similar problem I had was on windows, 98 I guess, I deleted all my files that weren't readonly by fiddling with a .bat script.

whitegrape · on April 14, 2016

Have an immutable filesystem, where "deletes" are recoverable by going back in time. At least until you do a scheduled "actual delete" that will reclaim disk space.

Another option (though last time I tried it, it didn't work..) is something like libtrash: http://pages.stern.nyu.edu/~marriaga/software/libtrash/ Deletes become moves and you can really delete when you like.

Practically speaking, if you're quick an 'rm' isn't totally destructive even without backups. There's a good chance your data is still there on the disk, it's just not associated with anything so it could be overridden at any point. Best to mount the disk read only and crawl through the raw bits to find your lost data (I recovered a week's worth of code this way several years ago).

derekp7 · on April 14, 2016

My favorite answer to the common interview question: "What was your biggest mistake, and how did you recover from it?" Answer: Back in 1993, I once deleted a critical data file. Fortunately, the AIX host was sitting next to me, so I quickly reached over and flipped the power switch off. The strategy being: writes were buffered and flushed out periodically, so hitting the power switch prevented that last write from hitting the disk. And if this didn't work (and caused more file system corruption), well I would have needed to restore from backup anyway.

nickpsecurity · on April 15, 2016

That's great. Straight out of an NCIS episode except it actually makes sense this time. :)

toomuchtodo · on April 14, 2016

> At least until you do a scheduled "actual delete" that will reclaim disk space.

And then you "actual delete" is where the data loss occurs :D

Avshalom · on April 14, 2016

Right but if you delete your entire file system there won't be anything to come along and do the "actual delete" so you're safe until some one comes along with a rescue disk or otherwise mounts it to a system that knows how to deal with this.

At the very least when you rm important-file.txt instead of importanr-file.txt you have a chance.

wruza · on April 17, 2016

Pre-delete would already hide files from apps and services for them to "fail fast", and actual delete would be just "i'm running fine for two days". Of course this implies that active open files should not be pre-deleted on unix at all (at least not by rm process). Even if you delete the entire filesystem with backups, there will be a chance to boot into recovery mode and undelete everything back. We can even go further and apply small-file-versioning on fs level to prevent misconfig accidents in /etc.

That's very simple and powerful, I can't tell why it is still not implemented today.

Avshalom · on April 14, 2016

well there's some safe guard checks you can do to prevent the easy abuses like require an additional flag for rm -rf-ing everything in the root or /home/*/, or across device boundaries as some one said.

You can redesign rm so people don't find themselves typing -rf as force of habit.

You can have multiple delete commands: mark as overwriteable if need and remove from 'ls -a', actually delete, overwrite sectors with zeros; like we have in guis.

You can have a permission system that isn't just "this account can do literally anything" or "this account can't do anything"

Avshalom · on April 14, 2016

here's a dumb idea off the top of my head: /etc/rmblacklist.conf autopopulated (by the distro) with a list of files (/boot and the actual bootimage for instance) that requires a GNU style long option --nuke to delete. It's still easily scriptable but you'd rarely ever actually need it and requiring a long option would serve as a double check that you meant it when you did.

Sure it's still possible to --nuke something due to a bug or negligence but I bet it'd cut down on fatal errors. Plus the user could have their own ~/.rmblacklist.conf to guard against particularly persistent dyslexias.

You could even erase the contents entirely if you really think being able to delete root on a whim keeps your kung fu strong.

nickpsecurity · on April 15, 2016

Good to see you fighting the fight with many sensible ideas. I've been posting this one in most responses as it's pretty simple:

https://launchpad.net/safe-rm

Avshalom · on April 15, 2016

Hah! Beat me to the punch by 5 years.

nickpsecurity · on April 15, 2016

Darn, I didnt notice it was that old. Makes situation even worse for UNIX rm defenders. Like when Trusted Xenix eliminated setuid vuln's mostly by clearing setuid bit during a write w/ admin having to manually reset it. Simple shit. Mainstream response? "Just audit all your apps for setuid and be extra careful in..." (sighs)

Tiksi · on April 15, 2016

You could just make the file immutable and read only:

http://paste.click/qMpmyO

Then remove the immutable flag if needed

justinlardinois · on April 15, 2016

This isn't a bad idea.

nickpsecurity · on April 15, 2016

https://launchpad.net/safe-rm

Use something like that with scripting. It really is that easy to protect critical stuff with otherwise easy-to-use rm. UNIX has always been resistant to do stuff like this. However, it's architecture makes it fairly easy for developers to do it themselves. That's to its credit.

blakeyrat · on April 14, 2016

There's no reason the Recycle Bin couldn't work for scripting languages, except for lazy OS and language developers haven't bothered to wire-it-up.

nickpsecurity · on April 15, 2016

That attitude was pervasive in both UNIX and C. That other systems avoided common problems of both with little work or penalty shows it's unnecessary today. Yet, the results of that attitude continue to do damage.

0xdeadbeefbabe · on April 14, 2016

The author stretches his analogies substantially. It's nice to see this was happening 22 years ago just like it is nice to see UNIX from more than 22 years ago still being used today.

Edit: i.e. it isn't so nice. Not much progress.

nickpsecurity · on April 15, 2016

Yeah, I saw the same problem in the NSA backdoor debate where people liken it to brake failures and such. I pointed out repeatedly that computer breaches rarely maim or kill anyone. Usually don't even cost them their jobs or bankrupt their business. Such analogies are strawmen to try to boost their argument with an emotional response.

mason240 · on April 14, 2016

>You can't design a sword that can be used safely by the untrained. No weapon can be, training with a weapon is a prerequisite for safely using it.

I was tank crewman in the US military, and there is a high chance of death of dismemberment from regular tank operation. Drivers have very limited visibility, tank turrets can pin people as they traverse, the breech of the main gun violently recoils into the crew compartment with limited guards, the rounds can be exploded by static electricity (or just plain lit on fire), I've saw one guy smash all teeth out when riding inside and the tank hit a hole, and so on.

We had a saying: "tanks are designed to kill, they don't care who."

Avshalom · on April 14, 2016

I'm pretty sure my computer is not designed to destroy businesses though.

soneil · on April 14, 2016

Your computer is designed to do exactly what it was told, at ludicrous speed. Much like the tank analogy, this is the very essence of the thing.

Avshalom · on April 14, 2016

Then why does rm have the f flag in the first place? Clearly some one thought oh-hay-guys maybe a safeguard wouldn't be out of place. They just designed a really awful safety.

auntienomen · on April 14, 2016

For the last 1000 years, swords have typically been constructed with a guard. It's a simple and useful safety feature, and it almost never prevents an intended use. Same for disallowing '/' as the target of rm.

Semyaz · on April 14, 2016

It's worth noting that since 2006 the default behavior for "rm -rf /" is to exit immediately with warnings.

http://superuser.com/questions/542978/is-it-possible-to-remo...

Impl0x · on April 14, 2016

I don't want to sound fussy, but sword guards aren't designed to protect the user from the sword, they're for protecting the user from other swords. That they make using a sword safer is a side-effect. I agree with the point you're making though.

danielweber · on April 14, 2016

After every disaster, you can come up with a process that would have stopped that specific disaster. That doesn't mean it's a good idea to implement that process everywhere.

auntienomen · on April 14, 2016

And yet swords have guards.

If a disaster recurs repeatedly -- and if fixing it costs essentially nothing -- then it should be fixed.

danso · on April 14, 2016

It's too bad you're being downvoted. It was not too long ago that people blamed the Germanwings Flight 9525 crash on how the cockpit door was designed to protect against hijackers, post 9/11:

http://www.nytimes.com/interactive/2015/03/26/world/europe/g...

bryanrasmussen · on April 14, 2016

And after every disaster that occurs again and again you might increase your interest in averting the next disaster.

BinaryIdiot · on April 14, 2016

> Imagine if a sword were made safety as first-class concern.

This seems like a silly example. A weapon, meant to dismember and maim attackers of its owner, is one of those things that's impossible to completely make safe. Granted I could think of plenty of ideas that maybe would make it safer at first that wouldn't compromise its ability to be effective as a weapon but it's simply not an apt example to be used here.

A computer is a general purpose device. It can be used to help image cancer, launch a nuclear weapon or play games. Considering that it's meant to be used by everyone, without discrimination, it seems to make sense that you need to do the best you can to protect the user from themselves.

I worked in Apple Care support for about a year. The majority of your users are not going to know all of the consequences to their actions, even ones doing system administration (because let's face it almost every company in the world needs at least a little of that now and not all of them are going to hire someone who knows what they're ding).

You can't protect a user from everything. But when you can protect a user from doing something that would have screwed up their whole system, lost a project, etc? That's helpful. Correcting input is what computers are essentially there for.

pyre · on April 14, 2016

What you're essentially saying is:

> It's impossible to have 100% safety, so let's not bother placing any importance on safety.

nickpsecurity · on April 14, 2016

Interesting points. I think systems like Burroughs counter the concept in the a lot of safety can be baked into a system. Here's what they did in 1961:

http://www.smecc.org/The%20Architecture%20%20of%20the%20Burr...

Notice that's good UI design for the time, hardware elimination of worst problems, interface checks on functions, limits of what apps can do to system, and plenty recovery. Systems like NonStop, KeyKOS, OpenVMS, XTS-400, JX, and so on added to these ideas. You can certainly bake a strong foundation of safety into a system while allowing plenty flexibility.

So, for example, critical files should be write-protected except for use by specific software involved in updates or administrative action. Many above systems did that. Then, one can use VMS-style, versioned filesystem that leaves originals in there in case a rollback is needed so long as there's free space for that. Such a system handling backups and restores with modern-sized HD's wouldn't have nuked everything. Might have even left everything if using lean setup but can't say for this specific case.

"You can't design a sword that can be used safely by the untrained."

A sword is designed to do damage. A better example would be a saw that's designed to be constructive but with risk of cutting your hand off. Even that can be designed to minimize risk to user.

https://www.youtube.com/watch?v=esnQwVZOrUU

"If you've picked the former right, (backing up human-readable information rather than data only readable by software programs that might go away in a crash) then risk is minimized."

That's orthogonal. A machine-readable format just needs a program to read it. The risk is whether the data is actually there in whole or part. This leads to mechanisms like append-only storage or periodic, read-only backups that ensure it's there. Or these clustered, replicated filesystems on machines with RAID arrays that lots of HPC or cloud applications use. Also, multiple, geographical locations for the data.

People doing the above with proven protocols/tools rarely loose their data. Then there's this guy.

rosser · on April 14, 2016

Table saws should never be used on flesh. rm(1) should always be used on files. How in FSM's noodly universe is the command supposed to intuit which files it should safely delete versus those it shouldn't?

> ...or administrative action.

You mean like, "sudo rm -rf {$undefined_value}/{$other_undefined_value}"? D'oh!

nickpsecurity · on April 15, 2016

Two different people here have already figured out this wouldn't have happened in OpenVMS due to versioned filesystem w/ rollback. People also claim saner commands for this stuff but I can't recall if remove was smarter.

Anyway, pertaining to RM, here you go:

https://launchpad.net/safe-rm

yrro · on April 14, 2016

Make `--one-file-system` the default!

rbanffy · on April 14, 2016

He really should not have made the first element of the path variable. Doing an "rm -rf /folder/{$undefined_value}/{$other_undefined_value}" would have made his day much better.

Also, never having all backup disk volumes mounted at the same time is good practice.

ams6110 · on April 14, 2016

There's also the phenomenon that people have an inherent tolerance of risk, so the "safer" you make something, the more reckless people tend to be.

When traction control and antilock brakes became mainstream, one result was that some people started driving faster on snowy roads, up until their risk tolerance was the same as before.

If you understand that a typo can destroy your business, you'll be careful to not log in as "root" on a routine basis and double check everything you do and keep good backups. On the other hand if you expect the system to prevent you from doing anything really damaging, you might be more careless about your approach.

794CD01 · on April 14, 2016

That's great. People got to their destinations faster with the same level of risk as before.

shadeless · on April 14, 2016

That theory is called Risk Compensation: https://en.wikipedia.org/wiki/Risk_compensation

vacri · on April 14, 2016

> You can't design a sword that can be used safely by the untrained. No weapon can be

This is a really weird argument, since weapons are used in combat, which is not 'safe' by definition.

But if you do want a weapon that the untrained can use without much chance of hurting themselves, look to a spear. It was the go-to weapon for untrained militias from the time history began up to gunpowder taking over - and even then, bayonets are stuck on rifles to turn them into spears.

randyrand · on April 14, 2016

Swords also don't run code. Just perhaps an importance difference.

nickpsecurity · on April 15, 2016

A counter that applies to almost every comparison the opposition brings up. Further, swords don't have an easy solution to stop problems that fits in a comment or two in this same thread. ;)

Koshkin · on April 15, 2016

Hmm... A self-wielding sword running Android sounds like a great idea.

ChrisArgyle · on April 14, 2016

  path=$foo/$bar

  if [[ $path =~ [[:space:]]*/[[:space:]]* ]]; then
  
      echo NOPE
  
  else
  
      rm -rf $path
  
  fi

clock_tower · on April 14, 2016

Or you could use VMS.

nickpsecurity · on April 15, 2016

BOOM! That's two of us that noticed an OS with the right combo of features to avoid this kind of crap.

SEJeff · on April 14, 2016

    $ sudo rm -rf /
    rm: it is dangerous to operate recursively on ‘/’
    rm: use --no-preserve-root to override this failsafe

Relatively modern distro, but this has been in coreutils for awhile (it was fixed in Ubuntu's coreutils in 2008 for instance)

    $ egrep '^(NAME|VERSION)=' /etc/os-release
    NAME="Red Hat Enterprise Linux Workstation"
    VERSION="7.2 (Maipo)"

Yes, I just did this on the workstation I'm typing on. I'm somewhat curious if he did that on absolutely ancient distros (>8 year old), or didn't actually run rm -rf, but the ansible python equivalent.

nickpsecurity · on April 15, 2016

Someone brought that to my attention on Schneier's blog suggesting the post was trolling. I'm holding off on going that far for now as some details might be missing & this sort of thing has happened many times. I don't use Ansible. Does it have the --no-preserve-root or other modifiers necessary on modern distros already in it?

gshulegaard · on April 14, 2016

I feel like if you use `rm -rf` particularly `--force` with privileges, it shouldn't be the job of Unix to stop you.

Also tangentially, if you don't have a sensible backup in place that would protect you from (or at least mitigate) a complete wipe of a single machine (or even all primary ones), you are doing something wrong.

lilyball · on April 14, 2016

The problem with that is people are trained to use -f because it's so annoying to try and use rm without it.

Really, the -f flag should just mean "don't ask for confirmation" and a separate flag should be required to mean something like "yes I do want to nuke my computer". And maybe there should be a flag that means "cross device boundaries", and by default it could refuse to delete anything that has a device number different than the argument it started with. That would at least prevent you from nuking your network-attached storage.

rosser · on April 14, 2016

This is probably a far better solution than trying to make rm smarter about which kinds of files it's somehow either "safe" or "unsafe" to unlink. Even then, though, you'd still get people who invoke it with "--across-devices --yes-i-really-mean-it" set, with unexpected and disastrous consequences.

And then someone will come along and bitch that rm isn't safe enough yet again.

lilyball · on April 14, 2016

Well, the point is that --across-devices won't be a flag you use as a matter of habit, it's something you'll only use when rm tells you it won't cross the device boundary and you realize that, yes, you really do want to delete across a device boundary. So you'll only add it in the specific cases where it's warranted, and you'll have already tried rm without it (to verify that you're deleting the right thing).

Come to think of it, I don't think I've ever used rm to delete across device boundaries. It just doesn't seem like an action you usually want to take.

blakeyrat · on April 14, 2016

... so then you change it to be more safe yet again.

I don't understand this attitude. Of course software isn't perfect; it's not even close, it's pretty awful. But the best thing about it: it's malleable. When things don't work, you change them to work better.

rosser · on April 14, 2016

When things don't work, you change them to work better.

I'd submit that adding layer upon layer of complexity to prevent all the myriad stupid things people might do using a particular piece of software isn't axiomatically "better".

Maybe if lives depend on its correct function, it's worth it, but that kind of strict requirements gathering and execution is well-understood by the people who live in that world.

Making sure that J. Random DevOps Dude doesn't foot-gun himself when he's paid to know better isn't that.

Avshalom · on April 15, 2016

But J. Random DevOps doesn't foot-gun himself he foot-guns the whole Op. You can fire him because it turns out he didn't know better but it's not going to fix the problem he created, the problem now affecting the whole company.

rosser · on April 15, 2016

Believe me, I know this. Entirely too well, in fact.

At my last job, the senior DevOps dude foot-gunned the entire company, by running a read-write disk benchmark (using fio(1)) against the block device (instead of a partition, which, while still stupid, would at least not have been actively destructive) on both my master and all of my slave PostgreSQL hosts. At the same time. And, of course, without telling anyone what he was doing, so the first inkling I had that there was a problem was about 20 minutes later, when I started getting a steadily increasing number of errors suggesting disk corruption.

How does one make such a tool drool-proof enough to prevent that kind of idiocy? Please, help me figure that out. And then give me a time machine, because that was a 16-hour day I'd really rather not have experienced.

And, no, the right move is generally not to fire the jackass who makes that kind of mistake. In my case, above, the company spent about three quarters of a million dollars (just in revenue, never mind how much time was burned in meetings about the incident, my efforts to fix the problem, as well as his and the rest of his team's efforts, and so on) teaching him never to do that again. You don't buy lessons that expensively and then let someone else benefit from them.

(That said, he did get fired several months later for telling the entire engineering lead team to fuck off, in so many words, for their having made a perfectly reasonable request, which was entirely within his responsibilities, and his skills, to satisfy.)

kruczek · on April 15, 2016

Well, there is "cross device boundaries" flag, but not enabled by default:

"--one-file-system

when removing a hierarchy recursively, skip any directory that is on a file system different from that of the corresponding command line argument"

ArkyBeagle · on April 14, 2016

You should not use it the first time and verify that you actually need it. The second's pause may save your data :) If you're feeling rash, get up and walk around until it passes.

packetslave · on April 14, 2016

Bingo. See also, "the drill story":

When I was in seventh grade, I took an Industrial Arts class ("woodshop"). The first few weeks of the course were spent going over safety of the machinery. In particular, I remember a heavy-handed message that Mr. Hopfer gave at the drill press:

This is a piece of industrial machinery. It is not a toy. If you put your hand on the stage and lower the bit, the machine will not jam up and make funny noises because it is too difficult. Instead, it will drill a hole through your hand. That is what makes it useful. If it didn't do that, you wouldn't be able to cut through wood.

entee · on April 14, 2016

There are actually ways to make it so it will in fact stop when contact with one's hand (for saws but you could see an analogous system for a drill press):

http://www.sawstop.com/why-sawstop/the-technology

Just because people should use something responsibly doesn't mean one shouldn't try to improve its inherent safety.

makomk · on April 14, 2016

That will also apparently stop when in contact with wood that's recently been cut down or is damp, destroying the (expensive) brake and blade in the process. It even has a bypass mode for that reason.

kabr · on April 15, 2016

SawStop cartridges are ~$70 USD. Not very expensive -- your hands are worth far more!

http://www.amazon.com/SawStop-TSBC-10R2-Cartridge-10-Inch-Bl...

nickpsecurity · on April 15, 2016

Way more with tens of thousands of accidents annually:

http://www.fairwarning.org/2013/05/after-more-than-a-decade-...

AlgorithmicTime · on April 14, 2016

That will also stop for wet wood, which is annoying.

Scriptor · on April 14, 2016

Luckily for us, computers are slightly more sophisticated than drills and we can incorporate sane safety checks with relative ease. In this case people are just asking for an explicit flag in the rare situation when they do want to delete everything.

cmcginty · on April 14, 2016

I really hate this type of attitude. Just because .0001% of users want to do something, doesn't mean the other 99.9999% need to suffer for it.

Are you against aircraft collision avoidance too? If your pilot wants to fly into another plane, then the guidance system shouldn't try to stop him, right?

dghughes · on April 14, 2016

> I feel like if you use `rm -rf` particularly `--force` with privileges, it shouldn't be the job of Unix to stop you.

I agree, that's what's so good Unix and GNU-Linux the freedom of root to do anything even major mistakes.

mcv · on April 15, 2016

`rm -r` does give warnings. You have to intentionally turn them off with the -f flag. It sounds like the -f has become too standard, a default flag, which undermines its entire purpose. You should only be using -f if you're absolutely 100% sure what you're doing, and that clearly wasn't the case here.

tripzilch · on April 15, 2016

A question, I checked the manpage for -f (I don't really ever use it on purpose, rm seems to work fine for most of my file deleting needs).

It says "ignore nonexistent files and arguments, never prompt".

Seems to me that the "never prompt" behaviour is an important requirement for a backup-script, though? Cause backup-scripts should work unattended, and definitely not pause and wait for input under any circumstances, right?

mcv · on April 15, 2016

True, but backup scripts that make use of rm -rf should have been really thoroughly tested, and contain a check that they are really only about to delete the thing they're supposed to delete.

Also, perhaps they shouldn't be running as root. There's no reason why any script should have permission to write everywhere. It just needs write access on the backup device.

vbezhenar · on April 14, 2016

He destroyed his company not with one line. He destroyed his company with extremely wrong setup. That one line just nailed it.

rbanffy · on April 14, 2016

He destroyed his company with a series of poor decisions. Running "rm -rf /" was just the last one. And it was the last one because it successfully destroyed the company.

eric_h · on April 14, 2016

Yep, it was an improper backup and recovery system that did him in. I'm sure many other businesses have been destroyed by the same, whether or not it was an rm -rf that was the trigger.

nickpsecurity · on April 15, 2016

Good point. It was a combination of things.

Natsu · on April 14, 2016

It's quite odd that his version of rm didn't require --no-preserve-root ... or else he has that turned on by default for some bizarre reason.

undefined1 · on April 14, 2016

Also in the case of programs running rm -rf.

Like what happened with the Linux version of Steam: https://github.com/valvesoftware/steam-for-linux/issues/3671

colin_mccabe · on April 15, 2016

The UNIX Hater's handbook was written a long time ago. At least on Linux, rm -rf / does not work unless you also pass --no-preserve-root.

enraged_camel · on April 14, 2016

Not sure I agree. My grandpa had a saying: "It's impossible to make things fool-proof because fools are so ingenious." No matter how many safety checks you add to a system, someone will find a way to fuck it up.

SixSigma · on April 15, 2016

That is not an excuse for letting sloppiness of thought introduce risk.

https://en.wikipedia.org/wiki/Poka-yoke

Poka-yoke (ポカヨケ?) is a Japanese term that means "mistake-proofing". A poka-yoke is any mechanism in a lean manufacturing process that helps an equipment operator avoid (yokeru) mistakes (poka). Its purpose is to eliminate product defects by preventing, correcting, or drawing attention to human errors as they occur. The concept was formalised, and the term adopted, by Shigeo Shingo as part of the Toyota Production System. It was originally described as baka-yoke, but as this means "fool-proofing" (or "idiot-proofing") the name was changed to the milder poka-yoke.

teh_klev · on April 14, 2016

The link to The UHH in that article appears to be broken, here's an archived copy:

https://web.archive.org/web/20120213211126/http://m.simson.n...

kenshaw · on April 15, 2016

rm -RF didn't destroy this guy's business; failure to maintain and adhere to an adequate backup policy did.

chris_wot · on April 15, 2016

There are two issues with this:

1. -f specifically forces the change - it's a "I know what in doing, don't warn me" option

2. He was running this in a script, and automating a process so he didn't want a warning

nickpsecurity · on April 15, 2016

Those are good observations. Both true. xg15 addressed some of the reasons why this is still a problem:

https://news.ycombinator.com/item?id=11499679

Here's an example of a simple alternative that lets you do what this person is doing while avoiding unnecessary hits to critical files:

https://launchpad.net/safe-rm

chris_wot · on April 15, 2016

Cool script! Thanks for that :-)

AdmiralAsshat · on April 14, 2016

This question that went unanswered in the replies bears repeating:

Any idea why the command actually ran? If $foo and $bar were both undefined, rm -rf / should have errored out with the --no-preserve-root message.

The only way I can think of that this would have actually worked on a CentOS7 machine is if $bar evaluated to , so what was run was rm -rf /.

As the above notes, I'm pretty sure recent versions of Redhat/CentOS actually protect against this sort of thing.

On the offchance you're not running a recent server, however, this could also be avoided by using `set -u` in the bash script, as it would cause undefined variables to error out.

harryf · on April 14, 2016

I believe those variables were not handled by the shell but rather in an Ansible "playbook" - see http://docs.ansible.com/ansible/playbooks_variables.html

i.e. the variables where happening in a Jinja template and because undefined, rm -rf {foo}/{bar} was transformed by the template engine into rm -rf /

rrauenza · on April 14, 2016

By default, ansible errors out for undefined variables.

http://docs.ansible.com/ansible/playbooks_filters.html#defau...

Torgo · on April 14, 2016

the playbook will fail if there are undefined variables, I find the story suspect.

weaksauce · on April 14, 2016

Is there a version of ansible that doesn't have this behavior?

harryf · on April 14, 2016

Or perhaps the variables were defined like

     foo = ""

Or were set via some function that could return a null

     bar = getValueOrNull()

draetheus · on April 14, 2016

Another lesson to be learned is that it's exceptionally bad practice to use Ansible to push out shell scripts that can be handled by native Ansible modules: http://docs.ansible.com/ansible/file_module.html

Kequc · on April 14, 2016

I feel like there is nobody replying who tested this theory because their computers don't work anymore.

It is a tragic story but rm -rf has been almost a joke in the industry for a very long while now. Even really old systems should have received an update of some form, to such an extent that the story in the op would be ridiculous rather than a discussion topic.

When I use the command I need to block out all distractions. I check my surroundings for things which might fall on my keyboard. I borderline make sure my phone is turned off before I carefully begin typing that.

I feel uncomfortable typing it into hacker news anywhere but the middle of a sentence. I can't imagine the bullets I would be sweating while deploying a bash script to all servers that included it. There is a problem that needs to be addressed.

LukeShu · on April 14, 2016

Before you apply `set -u` to all of your scripts, be aware that an empty array counts as undefined. So if you have arrays for which being empty is valid, be sure to `set +u` right before accessing them (and then `set -u` again after).

anderskaseorg · on April 14, 2016

Or write ${arr+"${arr[@]}"}, which is, you know, only slightly more awkward than bash arrays already were to begin with.

Really you just want to use a real programming language instead.

jschwartzi · on April 14, 2016

This is why you should never run rm -r with the -f switch the first time. -f says yes to all warnings, including the ones that confirm that you really mean to erase root. I see this constantly in SO and other answers, and it's a really really bad practice to do it without thinking.

vilhelm_s · on April 14, 2016

This was inside a shellscript though. It could well be that the script e.g. needed to remove write-protected files, so the flag really had to be there.

jschwartzi · on April 15, 2016

Even in a script, would you really want it not to prompt you or fail under those conditions? Files are usually write-protected for a reason.

nathancahill · on April 15, 2016

I wish git was better about having a deletable directory. Most of the time that I need to use the -f switch, it's because I get this trying to remove a .git directory:

    rm: remove write-protected regular file?

mgbmtl · on April 14, 2016

I'm a bit surprised the newspaper did not validate the source? They're basically quoting a Super User / Stack Exchange thread which was probably a troll.

If a hosting company had deleted 1535 client accounts, we would have heard other stories about it from angry clients?

ufmace · on April 14, 2016

It's kinda strange that major "mainstream" news publications are publishing articles with web board posts as the primary source, indeed, the entire story itself, with basically zero extra work. They could at least try to contact the guy, interview him, make sure he at least seems legit.

CM30 · on April 14, 2016

That's not too surprising, unfortunately. With the internet and the obsession with getting news out as quickly as possible (because hey, being the first to report something gives you a lot more backlinks and clicks), the standard of proof for a story has gone from 'a lot of evidence gathered through actual investigation' to 'someone said this on an internet forum somewhere'. Basically, the internet and social media rewards quick reporting, not accurate reporting or verification.

It's still better than a lot of gaming news sites though, where 'some guy mentioned something on Twitter/Reddit/4chan' is suddenly front page news within ten minutes.

throwanem · on April 14, 2016

Doubtful. The set of people willing to do business with a thumb-fingered goober like this one is likely disjoint with the set of HN readers.

loup-vaillant · on April 14, 2016

There's a specific reason why some called it fake: as adviced by others, Masarla ran the dd command to save the raw content of the disks, in case the recovery process screws up somehow.

He inverted the `if` and `of` arguments. You'd expect him to pay attention, after what happens. This doesn't pass the smell test for some.

Then again, you'd also expect him to be quite stressed out. That does make that mistake a bit more likely.

throwanem · on April 14, 2016

Considering that, if this isn't a troll, he wiped out his entire business with a single command, I think you're giving him too much credit here.

loup-vaillant · on April 14, 2016

Possibly, but the dd command does have more affordances than say, ld. `if` and `of` do stand for input file and output file, and it's harder to swap named arguments than positional arguments.

He could have swapped arguments like `/dev/sdc` and `/dev/sde` though…

JdeBP · on April 15, 2016

That is fairly probable, in fact. See that person's earlier question about block device naming.

* http://serverfault.com/q/728725

nikkwong · on April 14, 2016

The quotes in the article definitely does not match the diction you typically hear in a SE thread. SE users typically chime in with helpful logic-driven advice—the advice they gave this unfortunate person sounded more like 4chan or reddit—"You're screwed!!"

JdeBP · on April 15, 2016

Read the original for yourself, rather than simply guessing from what you think typically happens. The quotes match what was written:

* http://serverfault.com/questions/769357/recovering-from-a-rm...

Both of those people are ServerFault diamond moderators.

charltones · on April 15, 2016

Wouldn't have taken too much effort to confirm or deny the story.

The poster's SO profile is here: http://serverfault.com/users/251721/bleemboy

That lists twitter and github accounts. The github account lists the website of: http://www.thenetworksolution.it/

Which is an Italian provider of web design and hosting services. There is a phone number on the page.

yuhong · on April 14, 2016

I wonder if there are real cases of StackExchange questions similar to this one.

astrodust · on April 14, 2016

If it was 31337 accounts, it would be super-troll territory.

pmlnr · on April 14, 2016

Or 1507.

sp332 · on April 14, 2016

This felt a bit unlikely, but what really convinced me it was a troll was a follow-up comment where he said he accidentally switched "if" and "of" in a dd command.

mbakke · on April 14, 2016

Other giveaways:

* "more or less 1535 customers"

* ansible would fail if variables are unset (though they could be empty)

* no mention of --no-preserve-root

* who mounts an off-site backup, instead of pushing with scp/curl/whatever

* googling his name does not dig up any company

chris_wot · on April 14, 2016

If he mounted his backup media and wiped it, what makes you think he couldn't cockup the dd command?

However, under Linux rm -rf / needs --no-preserve-root to work, right?

aroch · on April 14, 2016

Depends on which version of `rm` you have. Newer versions (rm from coreutils 8.xx I believe) use no-preserve-root

profmonocle · on April 14, 2016

And the question is tagged centos 7, which is from 2014 - far too recent to not have --no-preserve-root.

JdeBP · on April 15, 2016

https://git.centos.org/log/rpms!coreutils.git/refs!heads!c7 indicates that CentOS 7 has coreutils 8.22 . http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=commit... shows that GNU coreutils gained that option in 2003.

astrodust · on April 14, 2016

It's distribution dependent.

mturmon · on April 14, 2016

Yes. To wit:

"I swapped if and of while doing dd. What to do now?" – Marco Marsala Apr 11 at 7:02

maus42 · on April 14, 2016

Does this article anything else than paraphrase the Serverfault thread? On the first reading I thought they had contacted the poor fellow to confirm his identity or something, but rereading, it would appear that they didn't: no further information than the original source.

https://serverfault.com/questions/769357/recovering-from-a-r...

(The user who asked that question uses now a nick, but had the real-soundish name mentioned in the article when I read that serverfault question first time earlier this week.)

edit. ...I really hope there isn't a real Marco Marsala someone pretended to be. Search engine results for that name are not great ATM.

aleden · on April 14, 2016

I accidentally set executable permissions on the following Makefile:

https://github.com/samalba/acdcontrol/blob/master/Makefile

While typing quickly I tab-completed to 'Makefile' and hit enter. Although it was a Makefile, it was executed as a bash script. bash ignored the incorrect syntax and executed line 10:

rm -rf $(DIRNAME)/*

If make parsed the file, $(DIRNAME) would have been nonempty. But it was empty under bash.

--no-preserve-root did not protect against this, because the target of the command was '/*'

scintill76 · on April 15, 2016

How/why does this work without "#!" at the beginning of the file? I just tested with fish shell, and I get "Exec format error. The file './x' is marked as an executable but could not be run by the operating system." But, from bash or dash it does execute commands from the file.

Tiksi · on April 15, 2016

It'll fall back to sh, the default shell (or whatever it's symlinked to): http://paste.click/QxWUMG

In my case that's bash, debian based systems use dash.

cyphar · on April 15, 2016

That shouldn't happen if you look at fs/exec.c (search_binary_handler) there isn't a "fallback to shell" option. And fs/binfmt_script.c doesn't fall back to shell either. Are you sure you don't have some weird binfmt_misc hook enabled?

JdeBP · on April 15, 2016

The "it" in what you are replying to is not the kernel; but the shell.

* http://pubs.opengroup.org/onlinepubs/007908799/xcu/chap2.htm...

* http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3...

cyphar · on April 15, 2016

Well, that's just dumb. Why on earth should "source this random file as a shell script" be the default?

JdeBP · on April 16, 2016

Because that's how Unix originally worked, to put it simply. This is a whole discussion subject in its own right, of course.

lazzlazzlazz · on April 14, 2016

And here it is.

ams6110 · on April 14, 2016

Years ago, I worked in an investment bank and we had a programmer put a batch program into production that executed the following as a shell command:

  rm -rf foo /

It was supposed to be:

  rm -rf foo/

It didn't run as root, but still managed to wipe out all the business data files. What saved us was that the servers were configured with RAID 1 and before the start of the nightly batch cycle, the mirror was "split" and only one copy mounted.

So we just had to restore the missing files from the other half of the mirror to revert to the start of the batch window and rerun the entire night's jobs.

ArkyBeagle · on April 14, 2016

So in other words, the acceptance protocol for this script was inadequate.

In the minicomputer era, it was common for a programmer to be required to run it on this one poor donkey of a machine to make it caught nothing on fire before moving to the big machine.

ams6110 · on April 14, 2016

It was the 1990s, not that it's an excuse but TDD wasn't a "thing" yet, nor was version control (at least not in that shop). For every change to a program we printed the diff, attached a cover memo, and filed it in a cabinet.

Yes, there were test systems and programers were supposed to test all their changes, but they also were the ones who deployed their own changes to production so there were ways for this to happen pretty easily.

ataylor32 · on April 14, 2016

Reminds me of this blunder:

https://github.com/MrMEEE/bumblebee-Old-and-abbandoned/commi...

...and its fix:

https://github.com/MrMEEE/bumblebee-Old-and-abbandoned/commi...

gnarbarian · on April 14, 2016

What happened to the programmer?

organsnyder · on April 14, 2016

If the programmer followed proper review and change control procedures, nothing should happen. Everyone writes bad code once in a while; in this particular case, the bug happened to be more catastrophic, but that's more bad luck than anything else.

We have code reviews and change controls not only to reduce the number of defects, but also to provide cover when mistakes inevitably slip through.

ams6110 · on April 14, 2016

Nothing. She continued to work there until after I moved on.

mfoy_ · on April 14, 2016

After a scare like that I'll bet she never makes that mistake again and makes sure to triple check for typos in dangerous commands.

twvisitavisitb · on April 14, 2016

Since I didn't find any links in the article, here's the original post:

http://serverfault.com/questions/769357/recovering-from-a-rm...

kamjam · on April 14, 2016

It's in the 2nd paragraph, linked from "called Server Fault" text.

mpdehaan2 · on April 17, 2016

NOTE: this is a hoax: https://news.ycombinator.com/item?id=11514455

spriggan3 · on April 14, 2016

I don't believe that's the truth, for a second. Of course the independent didn't look at the company in question to see if there was any litigation between this guy and his customers.

chris_wot · on April 14, 2016

The Independent also didn't look up -r as it stands for recursively remove directories...

dave2000 · on April 14, 2016

They said:

"the r deletes everything within a given directory"

which is what it does. Non technical readers aren't going to understand "recursively remove directories".

chris_wot · on April 15, 2016

Fair point.

CPLX · on April 14, 2016

The real news here is that the Independent will write a feature story on a successful forum troll. Where were they back in the days of the Fucked Company message board when we could have used their help?

oluwie · on April 14, 2016

Reminds me of the time I accidentally typed in 'crontab -d' instead of 'crontab -e'.

Those two letters are eerily too close to eachother.

wahnfrieden · on April 14, 2016

This is one of the reasons why we have infrastructure as code now, so system changes can be reviewed and tested just like application code, and more types of accidents can be reverted via source control :)

djsumdog · on April 14, 2016

In the article the guy is using ansible. He even had off-site backups, but they were mounted before his ansible playbooks ran, wiping them out as well.

_kst_ · on April 14, 2016

Another dangerous command is `crontab` with no arguments. It reads a new crontab from standard input. If you type Ctrl-C, it will abort and leave your existing crontab in place. If you type Ctrl-D, you've just created a new empty crontab and clobbered your old one.

My personal crontab is in a separate file in a source control system. I don't use `crontab -e`; I edit that file and feed it to the `crontab` command.

(It would be nice if HN handled backticks the way they're done in Markdown.)

jo909 · on April 14, 2016

A customer did this on his server once while I sat with him to add something. Since restoring the crontab from backups would have been a little inconvenient for such a small thing, I grepped the log files what commands were run by cron in what interval and had it rewritten in a few minutes.

laumars · on April 14, 2016

I've managed that one as well. One month into a new job and working on servers with complex applications installed that I wasn't yet familiar with. Thankfully I had `crontab -l` just beforehand, otherwise I'd have been screwed.

hrktb · on April 14, 2016

Most admins have a text file somewhere with the contents of the crontab. Specially because these things happen more often than it should.

laumars · on April 14, 2016

Indeed. I do generally keep backups of crontabs, not just in case of this kind of scenario but also in case the platform blows up in any unexpected ways. Sadly the company in question didn't. However I have since made it a personal policy to always -l before editing so I have a "backup" in my tmux scroll back (that time I mentioned before, it was pure chance that I had -l)

neotrinity · on April 14, 2016

did you mean crontab -r ?

I remember using crontab -r assuming that -r is to open it in read only mode. like vim -R

Bad assumption!

laumars · on April 14, 2016

> did you mean crontab -r ?

Some platforms it's -r, others it's -d. I suspect it's down to which cron daemon you run but never really cared enough to investigate. In any case, both are next to the 'e' key so either are just as dangerous in terms of typos.

vbezhenar · on April 14, 2016

A competent specialist will be able to help that guy. rm -rf / is easily fixable if you won't mess around after it. Backups usually have recognizable format, so it's possible to restore backups and then everything from backups.

l0c0b0x · on April 14, 2016

...and this is why we include a 'backup technology' question in our technical interviews--where 'offsite backup' must follow with something like "possibly the most important type of backup because..."

AdmiralAsshat · on April 14, 2016

You know the sad thing is that even this isn't idiot-proof and needs to be qualified. One of my customer's brilliant "cost-saving" measures was to have an offsite backup solution that was basically an rsync script that ran every 15 minutes.

So when someone on their end did something catastrophic to their data and it took them an hour to notice, they were incredulous that we couldn't help them restore their data even though it was "backed up offsite!" because their "backup" solution had already caught up and duplicated the broken data.

ansible · on April 14, 2016

And that's why if you're using rsync, you ought to be using rsnapshot instead, and have generations of backups so that you are not overwriting your most recent one.

SmellyGeekBoy · on April 15, 2016

I find rdiff-backup is great as a drop-in replacement.

cyphar · on April 15, 2016

Yeah, a copy is not a backup. It's a mistake too many people make. You need to have historical backups.

bhartzer · on April 14, 2016

He even deleted the backups.

>>the code had even deleted all of the backups that he had taken in case of catastrophe. Because the drives that were backing up the computers were mounted to it, the computer managed to wipe all of those, too.

rlpb · on April 14, 2016

In other words, he didn't have backups. A live copy of a running system available to that running system is not a backup.

Houshalter · on April 14, 2016

Well it still prevents against the hard drive failing. It doesn't protect against bad code, which he obviously didn't consider.

laumars · on April 15, 2016

RAID parity drives also protect against a drive failing but they're not backups either.

Houshalter · on April 15, 2016

If it's a duplicate copy of data intended in case of failure then yes it is a backup. Its not an offsite backup, but many people don't keep their personal backups offsite.

rlpb · on April 15, 2016

It is so deficient as a backup that I don't think it qualifies to be called a backup. That was my point.

Backups are expected to protect against data loss for a number of different failure cases (eg. disk failure, hardware fault leading to slow filesystem corruption, fire/theft, failed upgrade, "undo" for accidental change or deletion). There is a point where something addresses so few of these failure cases that you can't reasonably call it a backup.

laumars · on April 15, 2016

There's another term for what it is: redundancy.

Redundancy is there for fast recovery times (even zero downtime depending on how redundancy is implemented). It's not intended to run as a backup as redundancy devices are live and can fail from many of the same causes that will take your primary devices offline (fire, sysadmin fail, etc)

Likewise, if your "backups" are always online then it works better for business continuity than it does as a backup. So realistically it's more of a redundancy share.

drzaiusapelord · on April 14, 2016

If you're not doing offsite and cold backups, then you're just asking for trouble. If not crap like this then a fire or a ransomeware infection or a malicious employee, etc.

jasonjei · on April 14, 2016

He actually was doing a remote backup (although probably not a cold backup). Unfortunately, he had used mount instead of rsync over ssh, making it vulnerable to the rm -rf command.

zyxley · on April 14, 2016

That's not a backup, it's a mirror.

jasonjei · on April 14, 2016

Are you suggesting that you can't backup with rsync? Because you can do full and incremental backups with rsync.

In fact Time Machine on OS X looks like it does backups in this manner...

cyphar · on April 15, 2016

Just using rsync to make copies isn't a backup. If you use rsnapshot (which stores each copy separately) then you have a backup. Copies are not sufficient if you find out that something broke three weeks ago.

awqrre · on April 14, 2016

a mirror is kind of a backup if you don't update it live... but of course you should have other backups that are offline.

_l4jh · on April 14, 2016

If you are not doing offline and offsite backups you are not doing backups at all.

jo909 · on April 14, 2016

While as others already pointed out this story seems a little fishy, it serves well to reflect if something like this could in theory happen to your infrastructure.

Do you have your backup servers in the same configuration management software (ansible, puppet, ssh-for-loop etc) as the rest of the servers? One grave error (however unlikely) in your base configuration really can take down everything together in one fell swoop.

How "cold" are your backups? If the backup media are not physically disconnected and secured, you can most likely construct a scenario where the above, malware, a hacker or a rouge admin could destroy both the backups and the live data.

I will certainly suggest some additional safeguards for our backups.

ansible · on April 14, 2016

Yep, that's what I hope everyone will be doing... thinking about their own backups and infrastructure.

We have backups off-site on disconnected media, so that alone prevents the kind of accident we're talking about.

We use btrfs send / receive to send OS images from the primary container host to the backup container host. The snapshots are read-only, so I'm fairly sure I can't just 'rm -rf' them, I'd have to actually 'btrfs subvolume delete foobar' them.

I should try that though on one of the test servers...

castratikron · on April 14, 2016

The bash -e and -u options might have saved him here:

http://redsymbol.net/articles/unofficial-bash-strict-mode/

giovannibajo1 · on April 14, 2016

This. All my scripts begin with "set -euo pipefail", and my editor linter complains loudly if that line isn't there.

I wish distros would migrate to making those settings the default, over the years. Even if it would take a while, I think it would be priceless

nihonde · on April 14, 2016

Any script that includes rm -rf followed by variables in a path is an accident waiting to happen. Mounting the backup volumes is just icing on the cake for this extremely incompetent web hosting provider.

It made me nervous to type rm -rf in this comment form. Those letters are dark magic.

smegel · on April 14, 2016

> Mounting the backup volumes

That sounds more like an accident waiting to happen than a single line of bad code.

brador · on April 14, 2016

Why is the data not recoverable?

Maybe things have changed, but rm doesn't zero out the drive. And with the backup that was rm too it should all be recoverable. Or am I missing something?

Qantourisc · on April 14, 2016

Not directly no, but some FS give you a hard time to recover the file structure, which in some cases is a big problem. You could probably recover files, but if the backups aren't stored in a tar/zip/... file, it will be hard to recover both the data and the structure.

__david__ · on April 14, 2016

Most of the data is probably still there on the drive. But the filesystem data that says where it all actually is stored is probably irreplaceably gone. If some of that can be recovered then it should be possible to recreate individual files. Without it someone would have to guestimate where all the files are and then maybe manually piece them together (a single file may be in fragments in different parts of the disk). They'd also have to differentiate old deleted versions of files from the most recent deleted versions.

So yeah, it could technically be recovered, but it's going to be a very big chore.

redbeard0x0a · on April 14, 2016

Too bad he wasn't running Illuminos (OpenSolaris) based servers (or even some Solaris versions) that would have just flat out refused to run rm -rf /

neerdowell · on April 14, 2016

It's required in POSIX 1003.1-2013 that rm refuse to remove the root directory[0].

[0] http://pubs.opengroup.org/onlinepubs/9699919799/utilities/rm...