Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Man accidentally 'deletes his entire company' with one line of bad code (independent.co.uk)
233 points by bhartzer on April 14, 2016 | hide | past | favorite | 269 comments


What's most epic about this is it's in the UNIX Hater's Handbook. One of its rants was how better-designed systems would warn you if you were going to nuke your whole system. The reason is that a command to wipe the whole system was more likely a mistake than developer or admin's intent. UNIX would do it without blinking. Inherently unsafe programming and scripting combined with tools like that meant lots of UNIX boxes went kaput.

And today, over two decades later, a person just accidentally destroyed his entire company with one line without warnings on a UNIX. History repeats when its lessons aren't learned. This problem, like setuid, should've been eliminated by design fairly quickly after it was discovered.

http://esr.ibiblio.org/?p=538

EDIT: Added link to ESR's review of UNIX Hater's Handbook which links to UHH itself. Nicely covers what was rant, what was fixed, and what remains true. Linking in case people want to work on the latter plus my sour relationship with UNIX. :)


One of the more fascinating aspects of human history is how much effort we're willing to devote to creating varied senses of safety, even if that safety is only an illusion.

Now, we can call this a failure of design, but really, people who rely on technology they don't understand can't be saved by good design. Sure, this particular case could be fixed by disallowing the recursive flag on the file system root, but safety is never going to be able to be the primary design concern of any technological system.

Imagine if a sword were made safety as first-class concern. You can't design a sword that can be used safely by the untrained. No weapon can be, training with a weapon is a prerequisite for safely using it. Similarly, every technology has to be understood by those using it. If you don't you're just inviting trouble.

For a business using technology, the needs are actually fairly straightforward. You need an understanding of what needs to be backed up, and a process for performing the backups. If you've picked the former right, (backing up human-readable information rather than data only readable by software programs that might go away in a crash) then risk is minimized.


This is a strange post. You imply that, only because it's impossible to absolutely prevent all kinds of disasters, no efforts in safety should be taken at all.

By the same logic, you could strip away all the airbags, seatbelts, comfy seats and assistance systems of modern cars: After all, accidents still happen and safety mechanisms might even lure drivers into more reckless behavior. (This is in fact happening with seatbelts)

I think it's less useful to think about absolute safety than about which failures are likely to occur and how effective our measures against those specific failures are.

The --force flag is obviously not an effective measure against root deletions, otherwise we wouldn't have so many stories about it. My theory is that there are three reasons for it:

- As other people wrote, if you frequently batch-delete files, you get trained very quickly to always use -f as plain rm is very annoying to use for large sets of files. Unlike other flags, -f won't make you stop and think. This could be fixed by making rm-without-f actually useable - for example by only asking once and not for every file like, oh I don't know... Windows.

- rm can interact with shell parsing in very intransparent and fatal ways. My guess is that most root deletions happen similar to this post: not a literal rm -rf / but some unfortunate variable interpolations where the author didn't realize that they can evaluate to "/". That's a very unobvious point of failure that takes a lot longer to learn than just using rm. Therefore rm should absolutely warn about it.

- there is actually an expectation that rm could be safe as most deletes you do on a modern system are reversible - either because you have a " recycle bin" or a backup. So a warning would make sense to counter that expectation.


'vinceguidry actually makes a pretty good point. It's one thing to cover potential stupid mistakes with safety features. But beyond some point, safety starts to oppose utility - i.e. a perfectly safe car would be a simple chair. A perfectly safe software is also one that is totally useless for anything.

It's important to consider when designing software that safety should be about gracefully handling mistakes, and not something that should lure the user into false sense of not having to know what they're doing. Unfortunately, the latter attitude is what drives todays' UX patterns and software design in general, which is a big part of why tech-illiterate people remain tech-illiterate, and modern programs and devices are mostly shiny toys, not actual tools.


It's true that safety and also security can impair the usefulness of something past a certain point. It's also irrelevant to our current topic given the existence of systems that don't self-nuke easily. This is a UNIX-specific problem that they've fought to keep for over 20 years with admittedly some improvement. There were alternatives, both UNIX setups and non-UNIX OS's, that protected critical files or kept backups from being deleted [at all] without very specific action from an administrator. And nobody complained that they couldn't get work done on or maintain a VMS box.

So, this isn't some theoretical, abstract, extreme thing some are making it out to be. It's a situation where there's a number of ways to handle a few, routine tasks with inherent risk. Some OS's chose safer methods w/ unsafe methods available where absolutely necessary. UNIX decided on unsafe all around. Many UNIX boxes were lost as a result whereas alternates rarely were. It wasn't a necessity: merely an avoidable, design decision.


I'm glad we have the same opinion then - as I said, it's not very useful to reason about "perfect safety".

It's certainly possible to make a product "safer" then necessary and hinder utility (though I think "safety" is the wrong concept to look at here - see below) but if the common opinion of your product from tech-illiterate people is "complicated and scary", I think you can be pretty sure that you are still a long way away from that point.

In fact, some versions of rm do add additional protection against root deletions, e.g. the --no-preserve-root flag. What utility did that flag destroy?

I believe if you really want to make people more tech-literate (which today's apps are doing a horrible job of, I agree), you have to give them a honest and consistent view of their system, yes. But you also have to design the system such that they can learn and experiment as safely as possible and can quickly deduce what a certain action would do before they do it.

Cryptic commands, which are only understandable after extensive study of documentation, and which oh by the way become deadly in very specific circumstances don't help at all here.


"Cryptic commands, which are only understandable after extensive study of documentation, and which oh by the way become deadly in very specific circumstances don't help at all here."

Exactly. That's another problem that was repeatedly mentioned in UNIX Hater's Handbook. It still exists. Fortunately, there's distro's improving on aspects of organization, configuration, command shells, and so on. I'm particularly impressed with NixOS doing simple things that should've been done a long time ago.


> You imply that, only because it's impossible to absolutely prevent all kinds of disasters, no efforts in safety should be taken at all.

Not at all. We should absolutely work to make things safer. But we need to be realistic and temper our sense of idealism. Nothing was going to save this guy from disaster, if it wasn't 'rm' it just would have been something else.

My point is that you can't expect safety features to obviate the need to know what you're doing.


This could've stopped it unless he specifically coded it to destroy system files:

https://launchpad.net/safe-rm


Sounds very fatalistic. All people killed in car accidents would also have died from something else?


Many probably would - it's likely that the insanely reckless attitude towards motor vehicles will also reflect in other areas of life.


I don't know exactly how you'd plan on fighting reality. The fact that this guy was going to get hosed eventually isn't some justification, it's fact. People who do stupid things get burned.

If someone wants safety features, let them pay for them. If someone wants to add one, sure, so long as I can remove it if it gets in my way. Who knows, maybe they'll actually be worth having. But I'm not going to lose sleep over every idiot who ruins his life over something he didn't or couldn't learn about. There's absolutely nothing you can do to save stupid people from making stupid decisions.

Maybe I remove a safety feature I don't need and hurt myself with it. Now I'm the moron. Hopefully I learn from it. Nothing you could have done about that either.

Show me something foolproof, and I'll show you a greater fool.


In the UNIX Hater's Handbook, defenders of rm consider accidental deletion a "rite of passage" and remark that "any decent systems administrator should be doing regular backups" (see page 62). The author's response is funny:

“A rite of passage”? In no other industry could a manufacturer take such a cavalier attitude toward a faulty product. “But your honor, the exploding gas tank was just a rite of passage.” “Ladies and gentlemen of the jury, we will prove that the damage caused by the failure of the safety catch on our chainsaw was just a rite of passage for its users.” “May it please the court, we will show that getting bilked of their life savings by Mr. Keating was just a rite of passage for those retirees.” Right.

I'm surprised how relevant parts of this book are 22 years later.

http://www.vbcf.ac.at/fileadmin/user_upload/BioComp/training...


Is there an alternative to allow scripted destructive actions without the risk of deleting important stuff?

Modern OS's will warn you if you try to delete stuff, but you can still ultimately do it anyway, I don't see it as something particular to UNIX.

The only similar problem I had was on windows, 98 I guess, I deleted all my files that weren't readonly by fiddling with a .bat script.


Have an immutable filesystem, where "deletes" are recoverable by going back in time. At least until you do a scheduled "actual delete" that will reclaim disk space.

Another option (though last time I tried it, it didn't work..) is something like libtrash: http://pages.stern.nyu.edu/~marriaga/software/libtrash/ Deletes become moves and you can really delete when you like.

Practically speaking, if you're quick an 'rm' isn't totally destructive even without backups. There's a good chance your data is still there on the disk, it's just not associated with anything so it could be overridden at any point. Best to mount the disk read only and crawl through the raw bits to find your lost data (I recovered a week's worth of code this way several years ago).


My favorite answer to the common interview question: "What was your biggest mistake, and how did you recover from it?" Answer: Back in 1993, I once deleted a critical data file. Fortunately, the AIX host was sitting next to me, so I quickly reached over and flipped the power switch off. The strategy being: writes were buffered and flushed out periodically, so hitting the power switch prevented that last write from hitting the disk. And if this didn't work (and caused more file system corruption), well I would have needed to restore from backup anyway.


That's great. Straight out of an NCIS episode except it actually makes sense this time. :)


> At least until you do a scheduled "actual delete" that will reclaim disk space.

And then you "actual delete" is where the data loss occurs :D


Right but if you delete your entire file system there won't be anything to come along and do the "actual delete" so you're safe until some one comes along with a rescue disk or otherwise mounts it to a system that knows how to deal with this.

At the very least when you rm important-file.txt instead of importanr-file.txt you have a chance.


Pre-delete would already hide files from apps and services for them to "fail fast", and actual delete would be just "i'm running fine for two days". Of course this implies that active open files should not be pre-deleted on unix at all (at least not by rm process). Even if you delete the entire filesystem with backups, there will be a chance to boot into recovery mode and undelete everything back. We can even go further and apply small-file-versioning on fs level to prevent misconfig accidents in /etc.

That's very simple and powerful, I can't tell why it is still not implemented today.


well there's some safe guard checks you can do to prevent the easy abuses like require an additional flag for rm -rf-ing everything in the root or /home/*/, or across device boundaries as some one said.

You can redesign rm so people don't find themselves typing -rf as force of habit.

You can have multiple delete commands: mark as overwriteable if need and remove from 'ls -a', actually delete, overwrite sectors with zeros; like we have in guis.

You can have a permission system that isn't just "this account can do literally anything" or "this account can't do anything"


here's a dumb idea off the top of my head: /etc/rmblacklist.conf autopopulated (by the distro) with a list of files (/boot and the actual bootimage for instance) that requires a GNU style long option --nuke to delete. It's still easily scriptable but you'd rarely ever actually need it and requiring a long option would serve as a double check that you meant it when you did.

Sure it's still possible to --nuke something due to a bug or negligence but I bet it'd cut down on fatal errors. Plus the user could have their own ~/.rmblacklist.conf to guard against particularly persistent dyslexias.

You could even erase the contents entirely if you really think being able to delete root on a whim keeps your kung fu strong.


Good to see you fighting the fight with many sensible ideas. I've been posting this one in most responses as it's pretty simple:

https://launchpad.net/safe-rm


Hah! Beat me to the punch by 5 years.


Darn, I didnt notice it was that old. Makes situation even worse for UNIX rm defenders. Like when Trusted Xenix eliminated setuid vuln's mostly by clearing setuid bit during a write w/ admin having to manually reset it. Simple shit. Mainstream response? "Just audit all your apps for setuid and be extra careful in..." (sighs)


You could just make the file immutable and read only:

http://paste.click/qMpmyO

Then remove the immutable flag if needed


This isn't a bad idea.


https://launchpad.net/safe-rm

Use something like that with scripting. It really is that easy to protect critical stuff with otherwise easy-to-use rm. UNIX has always been resistant to do stuff like this. However, it's architecture makes it fairly easy for developers to do it themselves. That's to its credit.


There's no reason the Recycle Bin couldn't work for scripting languages, except for lazy OS and language developers haven't bothered to wire-it-up.


That attitude was pervasive in both UNIX and C. That other systems avoided common problems of both with little work or penalty shows it's unnecessary today. Yet, the results of that attitude continue to do damage.


The author stretches his analogies substantially. It's nice to see this was happening 22 years ago just like it is nice to see UNIX from more than 22 years ago still being used today.

Edit: i.e. it isn't so nice. Not much progress.


Yeah, I saw the same problem in the NSA backdoor debate where people liken it to brake failures and such. I pointed out repeatedly that computer breaches rarely maim or kill anyone. Usually don't even cost them their jobs or bankrupt their business. Such analogies are strawmen to try to boost their argument with an emotional response.


>You can't design a sword that can be used safely by the untrained. No weapon can be, training with a weapon is a prerequisite for safely using it.

I was tank crewman in the US military, and there is a high chance of death of dismemberment from regular tank operation. Drivers have very limited visibility, tank turrets can pin people as they traverse, the breech of the main gun violently recoils into the crew compartment with limited guards, the rounds can be exploded by static electricity (or just plain lit on fire), I've saw one guy smash all teeth out when riding inside and the tank hit a hole, and so on.

We had a saying: "tanks are designed to kill, they don't care who."


I'm pretty sure my computer is not designed to destroy businesses though.


Your computer is designed to do exactly what it was told, at ludicrous speed. Much like the tank analogy, this is the very essence of the thing.


Then why does rm have the f flag in the first place? Clearly some one thought oh-hay-guys maybe a safeguard wouldn't be out of place. They just designed a really awful safety.


For the last 1000 years, swords have typically been constructed with a guard. It's a simple and useful safety feature, and it almost never prevents an intended use. Same for disallowing '/' as the target of rm.


It's worth noting that since 2006 the default behavior for "rm -rf /" is to exit immediately with warnings.

http://superuser.com/questions/542978/is-it-possible-to-remo...


I don't want to sound fussy, but sword guards aren't designed to protect the user from the sword, they're for protecting the user from other swords. That they make using a sword safer is a side-effect. I agree with the point you're making though.


After every disaster, you can come up with a process that would have stopped that specific disaster. That doesn't mean it's a good idea to implement that process everywhere.


And yet swords have guards.

If a disaster recurs repeatedly -- and if fixing it costs essentially nothing -- then it should be fixed.


It's too bad you're being downvoted. It was not too long ago that people blamed the Germanwings Flight 9525 crash on how the cockpit door was designed to protect against hijackers, post 9/11:

http://www.nytimes.com/interactive/2015/03/26/world/europe/g...


And after every disaster that occurs again and again you might increase your interest in averting the next disaster.


> Imagine if a sword were made safety as first-class concern.

This seems like a silly example. A weapon, meant to dismember and maim attackers of its owner, is one of those things that's impossible to completely make safe. Granted I could think of plenty of ideas that maybe would make it safer at first that wouldn't compromise its ability to be effective as a weapon but it's simply not an apt example to be used here.

A computer is a general purpose device. It can be used to help image cancer, launch a nuclear weapon or play games. Considering that it's meant to be used by everyone, without discrimination, it seems to make sense that you need to do the best you can to protect the user from themselves.

I worked in Apple Care support for about a year. The majority of your users are not going to know all of the consequences to their actions, even ones doing system administration (because let's face it almost every company in the world needs at least a little of that now and not all of them are going to hire someone who knows what they're ding).

You can't protect a user from everything. But when you can protect a user from doing something that would have screwed up their whole system, lost a project, etc? That's helpful. Correcting input is what computers are essentially there for.


What you're essentially saying is:

> It's impossible to have 100% safety, so let's not bother placing any importance on safety.


Interesting points. I think systems like Burroughs counter the concept in the a lot of safety can be baked into a system. Here's what they did in 1961:

http://www.smecc.org/The%20Architecture%20%20of%20the%20Burr...

Notice that's good UI design for the time, hardware elimination of worst problems, interface checks on functions, limits of what apps can do to system, and plenty recovery. Systems like NonStop, KeyKOS, OpenVMS, XTS-400, JX, and so on added to these ideas. You can certainly bake a strong foundation of safety into a system while allowing plenty flexibility.

So, for example, critical files should be write-protected except for use by specific software involved in updates or administrative action. Many above systems did that. Then, one can use VMS-style, versioned filesystem that leaves originals in there in case a rollback is needed so long as there's free space for that. Such a system handling backups and restores with modern-sized HD's wouldn't have nuked everything. Might have even left everything if using lean setup but can't say for this specific case.

"You can't design a sword that can be used safely by the untrained."

A sword is designed to do damage. A better example would be a saw that's designed to be constructive but with risk of cutting your hand off. Even that can be designed to minimize risk to user.

https://www.youtube.com/watch?v=esnQwVZOrUU

"If you've picked the former right, (backing up human-readable information rather than data only readable by software programs that might go away in a crash) then risk is minimized."

That's orthogonal. A machine-readable format just needs a program to read it. The risk is whether the data is actually there in whole or part. This leads to mechanisms like append-only storage or periodic, read-only backups that ensure it's there. Or these clustered, replicated filesystems on machines with RAID arrays that lots of HPC or cloud applications use. Also, multiple, geographical locations for the data.

People doing the above with proven protocols/tools rarely loose their data. Then there's this guy.


Table saws should never be used on flesh. rm(1) should always be used on files. How in FSM's noodly universe is the command supposed to intuit which files it should safely delete versus those it shouldn't?

> ...or administrative action.

You mean like, "sudo rm -rf {$undefined_value}/{$other_undefined_value}"? D'oh!


Two different people here have already figured out this wouldn't have happened in OpenVMS due to versioned filesystem w/ rollback. People also claim saner commands for this stuff but I can't recall if remove was smarter.

Anyway, pertaining to RM, here you go:

https://launchpad.net/safe-rm


Make `--one-file-system` the default!


He really should not have made the first element of the path variable. Doing an "rm -rf /folder/{$undefined_value}/{$other_undefined_value}" would have made his day much better.

Also, never having all backup disk volumes mounted at the same time is good practice.


There's also the phenomenon that people have an inherent tolerance of risk, so the "safer" you make something, the more reckless people tend to be.

When traction control and antilock brakes became mainstream, one result was that some people started driving faster on snowy roads, up until their risk tolerance was the same as before.

If you understand that a typo can destroy your business, you'll be careful to not log in as "root" on a routine basis and double check everything you do and keep good backups. On the other hand if you expect the system to prevent you from doing anything really damaging, you might be more careless about your approach.


That's great. People got to their destinations faster with the same level of risk as before.


That theory is called Risk Compensation: https://en.wikipedia.org/wiki/Risk_compensation


> You can't design a sword that can be used safely by the untrained. No weapon can be

This is a really weird argument, since weapons are used in combat, which is not 'safe' by definition.

But if you do want a weapon that the untrained can use without much chance of hurting themselves, look to a spear. It was the go-to weapon for untrained militias from the time history began up to gunpowder taking over - and even then, bayonets are stuck on rifles to turn them into spears.


Swords also don't run code. Just perhaps an importance difference.


A counter that applies to almost every comparison the opposition brings up. Further, swords don't have an easy solution to stop problems that fits in a comment or two in this same thread. ;)


Hmm... A self-wielding sword running Android sounds like a great idea.


  path=$foo/$bar

  if [[ $path =~ [[:space:]]*/[[:space:]]* ]]; then
  
      echo NOPE
  
  else
  
      rm -rf $path
  
  fi


Or you could use VMS.


BOOM! That's two of us that noticed an OS with the right combo of features to avoid this kind of crap.


    $ sudo rm -rf /
    rm: it is dangerous to operate recursively on ‘/’
    rm: use --no-preserve-root to override this failsafe
Relatively modern distro, but this has been in coreutils for awhile (it was fixed in Ubuntu's coreutils in 2008 for instance)

    $ egrep '^(NAME|VERSION)=' /etc/os-release
    NAME="Red Hat Enterprise Linux Workstation"
    VERSION="7.2 (Maipo)"

Yes, I just did this on the workstation I'm typing on. I'm somewhat curious if he did that on absolutely ancient distros (>8 year old), or didn't actually run rm -rf, but the ansible python equivalent.


Someone brought that to my attention on Schneier's blog suggesting the post was trolling. I'm holding off on going that far for now as some details might be missing & this sort of thing has happened many times. I don't use Ansible. Does it have the --no-preserve-root or other modifiers necessary on modern distros already in it?


I feel like if you use `rm -rf` particularly `--force` with privileges, it shouldn't be the job of Unix to stop you.

Also tangentially, if you don't have a sensible backup in place that would protect you from (or at least mitigate) a complete wipe of a single machine (or even all primary ones), you are doing something wrong.


The problem with that is people are trained to use -f because it's so annoying to try and use rm without it.

Really, the -f flag should just mean "don't ask for confirmation" and a separate flag should be required to mean something like "yes I do want to nuke my computer". And maybe there should be a flag that means "cross device boundaries", and by default it could refuse to delete anything that has a device number different than the argument it started with. That would at least prevent you from nuking your network-attached storage.


This is probably a far better solution than trying to make rm smarter about which kinds of files it's somehow either "safe" or "unsafe" to unlink. Even then, though, you'd still get people who invoke it with "--across-devices --yes-i-really-mean-it" set, with unexpected and disastrous consequences.

And then someone will come along and bitch that rm isn't safe enough yet again.


Well, the point is that --across-devices won't be a flag you use as a matter of habit, it's something you'll only use when rm tells you it won't cross the device boundary and you realize that, yes, you really do want to delete across a device boundary. So you'll only add it in the specific cases where it's warranted, and you'll have already tried rm without it (to verify that you're deleting the right thing).

Come to think of it, I don't think I've ever used rm to delete across device boundaries. It just doesn't seem like an action you usually want to take.


... so then you change it to be more safe yet again.

I don't understand this attitude. Of course software isn't perfect; it's not even close, it's pretty awful. But the best thing about it: it's malleable. When things don't work, you change them to work better.


When things don't work, you change them to work better.

I'd submit that adding layer upon layer of complexity to prevent all the myriad stupid things people might do using a particular piece of software isn't axiomatically "better".

Maybe if lives depend on its correct function, it's worth it, but that kind of strict requirements gathering and execution is well-understood by the people who live in that world.

Making sure that J. Random DevOps Dude doesn't foot-gun himself when he's paid to know better isn't that.


But J. Random DevOps doesn't foot-gun himself he foot-guns the whole Op. You can fire him because it turns out he didn't know better but it's not going to fix the problem he created, the problem now affecting the whole company.


Believe me, I know this. Entirely too well, in fact.

At my last job, the senior DevOps dude foot-gunned the entire company, by running a read-write disk benchmark (using fio(1)) against the block device (instead of a partition, which, while still stupid, would at least not have been actively destructive) on both my master and all of my slave PostgreSQL hosts. At the same time. And, of course, without telling anyone what he was doing, so the first inkling I had that there was a problem was about 20 minutes later, when I started getting a steadily increasing number of errors suggesting disk corruption.

How does one make such a tool drool-proof enough to prevent that kind of idiocy? Please, help me figure that out. And then give me a time machine, because that was a 16-hour day I'd really rather not have experienced.

And, no, the right move is generally not to fire the jackass who makes that kind of mistake. In my case, above, the company spent about three quarters of a million dollars (just in revenue, never mind how much time was burned in meetings about the incident, my efforts to fix the problem, as well as his and the rest of his team's efforts, and so on) teaching him never to do that again. You don't buy lessons that expensively and then let someone else benefit from them.

(That said, he did get fired several months later for telling the entire engineering lead team to fuck off, in so many words, for their having made a perfectly reasonable request, which was entirely within his responsibilities, and his skills, to satisfy.)


Well, there is "cross device boundaries" flag, but not enabled by default:

"--one-file-system

when removing a hierarchy recursively, skip any directory that is on a file system different from that of the corresponding command line argument"


You should not use it the first time and verify that you actually need it. The second's pause may save your data :) If you're feeling rash, get up and walk around until it passes.


Bingo. See also, "the drill story":

When I was in seventh grade, I took an Industrial Arts class ("woodshop"). The first few weeks of the course were spent going over safety of the machinery. In particular, I remember a heavy-handed message that Mr. Hopfer gave at the drill press:

This is a piece of industrial machinery. It is not a toy. If you put your hand on the stage and lower the bit, the machine will not jam up and make funny noises because it is too difficult. Instead, it will drill a hole through your hand. That is what makes it useful. If it didn't do that, you wouldn't be able to cut through wood.


There are actually ways to make it so it will in fact stop when contact with one's hand (for saws but you could see an analogous system for a drill press):

http://www.sawstop.com/why-sawstop/the-technology

Just because people should use something responsibly doesn't mean one shouldn't try to improve its inherent safety.


That will also apparently stop when in contact with wood that's recently been cut down or is damp, destroying the (expensive) brake and blade in the process. It even has a bypass mode for that reason.


SawStop cartridges are ~$70 USD. Not very expensive -- your hands are worth far more!

http://www.amazon.com/SawStop-TSBC-10R2-Cartridge-10-Inch-Bl...


Way more with tens of thousands of accidents annually:

http://www.fairwarning.org/2013/05/after-more-than-a-decade-...


That will also stop for wet wood, which is annoying.


Luckily for us, computers are slightly more sophisticated than drills and we can incorporate sane safety checks with relative ease. In this case people are just asking for an explicit flag in the rare situation when they do want to delete everything.


I really hate this type of attitude. Just because .0001% of users want to do something, doesn't mean the other 99.9999% need to suffer for it.

Are you against aircraft collision avoidance too? If your pilot wants to fly into another plane, then the guidance system shouldn't try to stop him, right?


> I feel like if you use `rm -rf` particularly `--force` with privileges, it shouldn't be the job of Unix to stop you.

I agree, that's what's so good Unix and GNU-Linux the freedom of root to do anything even major mistakes.


`rm -r` does give warnings. You have to intentionally turn them off with the -f flag. It sounds like the -f has become too standard, a default flag, which undermines its entire purpose. You should only be using -f if you're absolutely 100% sure what you're doing, and that clearly wasn't the case here.


A question, I checked the manpage for -f (I don't really ever use it on purpose, rm seems to work fine for most of my file deleting needs).

It says "ignore nonexistent files and arguments, never prompt".

Seems to me that the "never prompt" behaviour is an important requirement for a backup-script, though? Cause backup-scripts should work unattended, and definitely not pause and wait for input under any circumstances, right?


True, but backup scripts that make use of rm -rf should have been really thoroughly tested, and contain a check that they are really only about to delete the thing they're supposed to delete.

Also, perhaps they shouldn't be running as root. There's no reason why any script should have permission to write everywhere. It just needs write access on the backup device.


He destroyed his company not with one line. He destroyed his company with extremely wrong setup. That one line just nailed it.


He destroyed his company with a series of poor decisions. Running "rm -rf /" was just the last one. And it was the last one because it successfully destroyed the company.


Yep, it was an improper backup and recovery system that did him in. I'm sure many other businesses have been destroyed by the same, whether or not it was an rm -rf that was the trigger.


Good point. It was a combination of things.


It's quite odd that his version of rm didn't require --no-preserve-root ... or else he has that turned on by default for some bizarre reason.


Also in the case of programs running rm -rf.

Like what happened with the Linux version of Steam: https://github.com/valvesoftware/steam-for-linux/issues/3671


The UNIX Hater's handbook was written a long time ago. At least on Linux, rm -rf / does not work unless you also pass --no-preserve-root.


Not sure I agree. My grandpa had a saying: "It's impossible to make things fool-proof because fools are so ingenious." No matter how many safety checks you add to a system, someone will find a way to fuck it up.


That is not an excuse for letting sloppiness of thought introduce risk.

https://en.wikipedia.org/wiki/Poka-yoke

Poka-yoke (ポカヨケ?) is a Japanese term that means "mistake-proofing". A poka-yoke is any mechanism in a lean manufacturing process that helps an equipment operator avoid (yokeru) mistakes (poka). Its purpose is to eliminate product defects by preventing, correcting, or drawing attention to human errors as they occur. The concept was formalised, and the term adopted, by Shigeo Shingo as part of the Toyota Production System. It was originally described as baka-yoke, but as this means "fool-proofing" (or "idiot-proofing") the name was changed to the milder poka-yoke.


The link to The UHH in that article appears to be broken, here's an archived copy:

https://web.archive.org/web/20120213211126/http://m.simson.n...


rm -RF didn't destroy this guy's business; failure to maintain and adhere to an adequate backup policy did.


There are two issues with this:

1. -f specifically forces the change - it's a "I know what in doing, don't warn me" option

2. He was running this in a script, and automating a process so he didn't want a warning


Those are good observations. Both true. xg15 addressed some of the reasons why this is still a problem:

https://news.ycombinator.com/item?id=11499679

Here's an example of a simple alternative that lets you do what this person is doing while avoiding unnecessary hits to critical files:

https://launchpad.net/safe-rm


Cool script! Thanks for that :-)


This question that went unanswered in the replies bears repeating:

Any idea why the command actually ran? If $foo and $bar were both undefined, rm -rf / should have errored out with the --no-preserve-root message.

The only way I can think of that this would have actually worked on a CentOS7 machine is if $bar evaluated to , so what was run was rm -rf /.

As the above notes, I'm pretty sure recent versions of Redhat/CentOS actually protect against this sort of thing.

On the offchance you're not running a recent server, however, this could also be avoided by using `set -u` in the bash script, as it would cause undefined variables to error out.


I believe those variables were not handled by the shell but rather in an Ansible "playbook" - see http://docs.ansible.com/ansible/playbooks_variables.html

i.e. the variables where happening in a Jinja template and because undefined, rm -rf {foo}/{bar} was transformed by the template engine into rm -rf /


By default, ansible errors out for undefined variables.

http://docs.ansible.com/ansible/playbooks_filters.html#defau...


the playbook will fail if there are undefined variables, I find the story suspect.


Is there a version of ansible that doesn't have this behavior?


Or perhaps the variables were defined like

     foo = ""
Or were set via some function that could return a null

     bar = getValueOrNull()


Another lesson to be learned is that it's exceptionally bad practice to use Ansible to push out shell scripts that can be handled by native Ansible modules: http://docs.ansible.com/ansible/file_module.html


I feel like there is nobody replying who tested this theory because their computers don't work anymore.

It is a tragic story but rm -rf has been almost a joke in the industry for a very long while now. Even really old systems should have received an update of some form, to such an extent that the story in the op would be ridiculous rather than a discussion topic.

When I use the command I need to block out all distractions. I check my surroundings for things which might fall on my keyboard. I borderline make sure my phone is turned off before I carefully begin typing that.

I feel uncomfortable typing it into hacker news anywhere but the middle of a sentence. I can't imagine the bullets I would be sweating while deploying a bash script to all servers that included it. There is a problem that needs to be addressed.


Before you apply `set -u` to all of your scripts, be aware that an empty array counts as undefined. So if you have arrays for which being empty is valid, be sure to `set +u` right before accessing them (and then `set -u` again after).


Or write ${arr+"${arr[@]}"}, which is, you know, only slightly more awkward than bash arrays already were to begin with.

Really you just want to use a real programming language instead.


This is why you should never run rm -r with the -f switch the first time. -f says yes to all warnings, including the ones that confirm that you really mean to erase root. I see this constantly in SO and other answers, and it's a really really bad practice to do it without thinking.


This was inside a shellscript though. It could well be that the script e.g. needed to remove write-protected files, so the flag really had to be there.


Even in a script, would you really want it not to prompt you or fail under those conditions? Files are usually write-protected for a reason.


I wish git was better about having a deletable directory. Most of the time that I need to use the -f switch, it's because I get this trying to remove a .git directory:

    rm: remove write-protected regular file?


I'm a bit surprised the newspaper did not validate the source? They're basically quoting a Super User / Stack Exchange thread which was probably a troll.

If a hosting company had deleted 1535 client accounts, we would have heard other stories about it from angry clients?


It's kinda strange that major "mainstream" news publications are publishing articles with web board posts as the primary source, indeed, the entire story itself, with basically zero extra work. They could at least try to contact the guy, interview him, make sure he at least seems legit.


That's not too surprising, unfortunately. With the internet and the obsession with getting news out as quickly as possible (because hey, being the first to report something gives you a lot more backlinks and clicks), the standard of proof for a story has gone from 'a lot of evidence gathered through actual investigation' to 'someone said this on an internet forum somewhere'. Basically, the internet and social media rewards quick reporting, not accurate reporting or verification.

It's still better than a lot of gaming news sites though, where 'some guy mentioned something on Twitter/Reddit/4chan' is suddenly front page news within ten minutes.


Doubtful. The set of people willing to do business with a thumb-fingered goober like this one is likely disjoint with the set of HN readers.


There's a specific reason why some called it fake: as adviced by others, Masarla ran the dd command to save the raw content of the disks, in case the recovery process screws up somehow.

He inverted the `if` and `of` arguments. You'd expect him to pay attention, after what happens. This doesn't pass the smell test for some.

Then again, you'd also expect him to be quite stressed out. That does make that mistake a bit more likely.


Considering that, if this isn't a troll, he wiped out his entire business with a single command, I think you're giving him too much credit here.


Possibly, but the dd command does have more affordances than say, ld. `if` and `of` do stand for input file and output file, and it's harder to swap named arguments than positional arguments.

He could have swapped arguments like `/dev/sdc` and `/dev/sde` though…


That is fairly probable, in fact. See that person's earlier question about block device naming.

* http://serverfault.com/q/728725


The quotes in the article definitely does not match the diction you typically hear in a SE thread. SE users typically chime in with helpful logic-driven advice—the advice they gave this unfortunate person sounded more like 4chan or reddit—"You're screwed!!"


Read the original for yourself, rather than simply guessing from what you think typically happens. The quotes match what was written:

* http://serverfault.com/questions/769357/recovering-from-a-rm...

* http://serverfault.com/questions/769357/recovering-from-a-rm...

Both of those people are ServerFault diamond moderators.


Wouldn't have taken too much effort to confirm or deny the story.

The poster's SO profile is here: http://serverfault.com/users/251721/bleemboy

That lists twitter and github accounts. The github account lists the website of: http://www.thenetworksolution.it/

Which is an Italian provider of web design and hosting services. There is a phone number on the page.


I wonder if there are real cases of StackExchange questions similar to this one.


If it was 31337 accounts, it would be super-troll territory.


Or 1507.


This felt a bit unlikely, but what really convinced me it was a troll was a follow-up comment where he said he accidentally switched "if" and "of" in a dd command.


Other giveaways:

* "more or less 1535 customers"

* ansible would fail if variables are unset (though they could be empty)

* no mention of --no-preserve-root

* who mounts an off-site backup, instead of pushing with scp/curl/whatever

* googling his name does not dig up any company


If he mounted his backup media and wiped it, what makes you think he couldn't cockup the dd command?

However, under Linux rm -rf / needs --no-preserve-root to work, right?


Depends on which version of `rm` you have. Newer versions (rm from coreutils 8.xx I believe) use no-preserve-root


And the question is tagged centos 7, which is from 2014 - far too recent to not have --no-preserve-root.


https://git.centos.org/log/rpms!coreutils.git/refs!heads!c7 indicates that CentOS 7 has coreutils 8.22 . http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=commit... shows that GNU coreutils gained that option in 2003.


It's distribution dependent.


Yes. To wit:

"I swapped if and of while doing dd. What to do now?" – Marco Marsala Apr 11 at 7:02


Does this article anything else than paraphrase the Serverfault thread? On the first reading I thought they had contacted the poor fellow to confirm his identity or something, but rereading, it would appear that they didn't: no further information than the original source.

https://serverfault.com/questions/769357/recovering-from-a-r...

(The user who asked that question uses now a nick, but had the real-soundish name mentioned in the article when I read that serverfault question first time earlier this week.)

edit. ...I really hope there isn't a real Marco Marsala someone pretended to be. Search engine results for that name are not great ATM.


I accidentally set executable permissions on the following Makefile:

https://github.com/samalba/acdcontrol/blob/master/Makefile

While typing quickly I tab-completed to 'Makefile' and hit enter. Although it was a Makefile, it was executed as a bash script. bash ignored the incorrect syntax and executed line 10:

rm -rf $(DIRNAME)/*

If make parsed the file, $(DIRNAME) would have been nonempty. But it was empty under bash.

--no-preserve-root did not protect against this, because the target of the command was '/*'


How/why does this work without "#!" at the beginning of the file? I just tested with fish shell, and I get "Exec format error. The file './x' is marked as an executable but could not be run by the operating system." But, from bash or dash it does execute commands from the file.


It'll fall back to sh, the default shell (or whatever it's symlinked to): http://paste.click/QxWUMG

In my case that's bash, debian based systems use dash.


That shouldn't happen if you look at fs/exec.c (search_binary_handler) there isn't a "fallback to shell" option. And fs/binfmt_script.c doesn't fall back to shell either. Are you sure you don't have some weird binfmt_misc hook enabled?



Well, that's just dumb. Why on earth should "source this random file as a shell script" be the default?


Because that's how Unix originally worked, to put it simply. This is a whole discussion subject in its own right, of course.


And here it is.


Years ago, I worked in an investment bank and we had a programmer put a batch program into production that executed the following as a shell command:

  rm -rf foo /
It was supposed to be:

  rm -rf foo/
It didn't run as root, but still managed to wipe out all the business data files. What saved us was that the servers were configured with RAID 1 and before the start of the nightly batch cycle, the mirror was "split" and only one copy mounted.

So we just had to restore the missing files from the other half of the mirror to revert to the start of the batch window and rerun the entire night's jobs.


So in other words, the acceptance protocol for this script was inadequate.

In the minicomputer era, it was common for a programmer to be required to run it on this one poor donkey of a machine to make it caught nothing on fire before moving to the big machine.


It was the 1990s, not that it's an excuse but TDD wasn't a "thing" yet, nor was version control (at least not in that shop). For every change to a program we printed the diff, attached a cover memo, and filed it in a cabinet.

Yes, there were test systems and programers were supposed to test all their changes, but they also were the ones who deployed their own changes to production so there were ways for this to happen pretty easily.



What happened to the programmer?


If the programmer followed proper review and change control procedures, nothing should happen. Everyone writes bad code once in a while; in this particular case, the bug happened to be more catastrophic, but that's more bad luck than anything else.

We have code reviews and change controls not only to reduce the number of defects, but also to provide cover when mistakes inevitably slip through.


Nothing. She continued to work there until after I moved on.


After a scare like that I'll bet she never makes that mistake again and makes sure to triple check for typos in dangerous commands.


Since I didn't find any links in the article, here's the original post:

http://serverfault.com/questions/769357/recovering-from-a-rm...


It's in the 2nd paragraph, linked from "called Server Fault" text.



I don't believe that's the truth, for a second. Of course the independent didn't look at the company in question to see if there was any litigation between this guy and his customers.


The Independent also didn't look up -r as it stands for recursively remove directories...


They said:

"the r deletes everything within a given directory"

which is what it does. Non technical readers aren't going to understand "recursively remove directories".


Fair point.


The real news here is that the Independent will write a feature story on a successful forum troll. Where were they back in the days of the Fucked Company message board when we could have used their help?


Reminds me of the time I accidentally typed in 'crontab -d' instead of 'crontab -e'.

Those two letters are eerily too close to eachother.


This is one of the reasons why we have infrastructure as code now, so system changes can be reviewed and tested just like application code, and more types of accidents can be reverted via source control :)


In the article the guy is using ansible. He even had off-site backups, but they were mounted before his ansible playbooks ran, wiping them out as well.


Another dangerous command is `crontab` with no arguments. It reads a new crontab from standard input. If you type Ctrl-C, it will abort and leave your existing crontab in place. If you type Ctrl-D, you've just created a new empty crontab and clobbered your old one.

My personal crontab is in a separate file in a source control system. I don't use `crontab -e`; I edit that file and feed it to the `crontab` command.

(It would be nice if HN handled backticks the way they're done in Markdown.)


A customer did this on his server once while I sat with him to add something. Since restoring the crontab from backups would have been a little inconvenient for such a small thing, I grepped the log files what commands were run by cron in what interval and had it rewritten in a few minutes.


I've managed that one as well. One month into a new job and working on servers with complex applications installed that I wasn't yet familiar with. Thankfully I had `crontab -l` just beforehand, otherwise I'd have been screwed.


Most admins have a text file somewhere with the contents of the crontab. Specially because these things happen more often than it should.


Indeed. I do generally keep backups of crontabs, not just in case of this kind of scenario but also in case the platform blows up in any unexpected ways. Sadly the company in question didn't. However I have since made it a personal policy to always -l before editing so I have a "backup" in my tmux scroll back (that time I mentioned before, it was pure chance that I had -l)


did you mean crontab -r ?

I remember using crontab -r assuming that -r is to open it in read only mode. like vim -R

Bad assumption!


> did you mean crontab -r ?

Some platforms it's -r, others it's -d. I suspect it's down to which cron daemon you run but never really cared enough to investigate. In any case, both are next to the 'e' key so either are just as dangerous in terms of typos.


A competent specialist will be able to help that guy. rm -rf / is easily fixable if you won't mess around after it. Backups usually have recognizable format, so it's possible to restore backups and then everything from backups.


...and this is why we include a 'backup technology' question in our technical interviews--where 'offsite backup' must follow with something like "possibly the most important type of backup because..."


You know the sad thing is that even this isn't idiot-proof and needs to be qualified. One of my customer's brilliant "cost-saving" measures was to have an offsite backup solution that was basically an rsync script that ran every 15 minutes.

So when someone on their end did something catastrophic to their data and it took them an hour to notice, they were incredulous that we couldn't help them restore their data even though it was "backed up offsite!" because their "backup" solution had already caught up and duplicated the broken data.


And that's why if you're using rsync, you ought to be using rsnapshot instead, and have generations of backups so that you are not overwriting your most recent one.


I find rdiff-backup is great as a drop-in replacement.


Yeah, a copy is not a backup. It's a mistake too many people make. You need to have historical backups.


He even deleted the backups.

>>the code had even deleted all of the backups that he had taken in case of catastrophe. Because the drives that were backing up the computers were mounted to it, the computer managed to wipe all of those, too.


In other words, he didn't have backups. A live copy of a running system available to that running system is not a backup.


Well it still prevents against the hard drive failing. It doesn't protect against bad code, which he obviously didn't consider.


RAID parity drives also protect against a drive failing but they're not backups either.


If it's a duplicate copy of data intended in case of failure then yes it is a backup. Its not an offsite backup, but many people don't keep their personal backups offsite.


It is so deficient as a backup that I don't think it qualifies to be called a backup. That was my point.

Backups are expected to protect against data loss for a number of different failure cases (eg. disk failure, hardware fault leading to slow filesystem corruption, fire/theft, failed upgrade, "undo" for accidental change or deletion). There is a point where something addresses so few of these failure cases that you can't reasonably call it a backup.


There's another term for what it is: redundancy.

Redundancy is there for fast recovery times (even zero downtime depending on how redundancy is implemented). It's not intended to run as a backup as redundancy devices are live and can fail from many of the same causes that will take your primary devices offline (fire, sysadmin fail, etc)

Likewise, if your "backups" are always online then it works better for business continuity than it does as a backup. So realistically it's more of a redundancy share.


If you're not doing offsite and cold backups, then you're just asking for trouble. If not crap like this then a fire or a ransomeware infection or a malicious employee, etc.


He actually was doing a remote backup (although probably not a cold backup). Unfortunately, he had used mount instead of rsync over ssh, making it vulnerable to the rm -rf command.


That's not a backup, it's a mirror.


Are you suggesting that you can't backup with rsync? Because you can do full and incremental backups with rsync.

In fact Time Machine on OS X looks like it does backups in this manner...


Just using rsync to make copies isn't a backup. If you use rsnapshot (which stores each copy separately) then you have a backup. Copies are not sufficient if you find out that something broke three weeks ago.


a mirror is kind of a backup if you don't update it live... but of course you should have other backups that are offline.


If you are not doing offline and offsite backups you are not doing backups at all.


While as others already pointed out this story seems a little fishy, it serves well to reflect if something like this could in theory happen to your infrastructure.

Do you have your backup servers in the same configuration management software (ansible, puppet, ssh-for-loop etc) as the rest of the servers? One grave error (however unlikely) in your base configuration really can take down everything together in one fell swoop.

How "cold" are your backups? If the backup media are not physically disconnected and secured, you can most likely construct a scenario where the above, malware, a hacker or a rouge admin could destroy both the backups and the live data.

I will certainly suggest some additional safeguards for our backups.


Yep, that's what I hope everyone will be doing... thinking about their own backups and infrastructure.

We have backups off-site on disconnected media, so that alone prevents the kind of accident we're talking about.

We use btrfs send / receive to send OS images from the primary container host to the backup container host. The snapshots are read-only, so I'm fairly sure I can't just 'rm -rf' them, I'd have to actually 'btrfs subvolume delete foobar' them.

I should try that though on one of the test servers...


The bash -e and -u options might have saved him here:

http://redsymbol.net/articles/unofficial-bash-strict-mode/


This. All my scripts begin with "set -euo pipefail", and my editor linter complains loudly if that line isn't there.

I wish distros would migrate to making those settings the default, over the years. Even if it would take a while, I think it would be priceless


Any script that includes rm -rf followed by variables in a path is an accident waiting to happen. Mounting the backup volumes is just icing on the cake for this extremely incompetent web hosting provider.

It made me nervous to type rm -rf in this comment form. Those letters are dark magic.


> Mounting the backup volumes

That sounds more like an accident waiting to happen than a single line of bad code.


Why is the data not recoverable?

Maybe things have changed, but rm doesn't zero out the drive. And with the backup that was rm too it should all be recoverable. Or am I missing something?


Not directly no, but some FS give you a hard time to recover the file structure, which in some cases is a big problem. You could probably recover files, but if the backups aren't stored in a tar/zip/... file, it will be hard to recover both the data and the structure.


Most of the data is probably still there on the drive. But the filesystem data that says where it all actually is stored is probably irreplaceably gone. If some of that can be recovered then it should be possible to recreate individual files. Without it someone would have to guestimate where all the files are and then maybe manually piece them together (a single file may be in fragments in different parts of the disk). They'd also have to differentiate old deleted versions of files from the most recent deleted versions.

So yeah, it could technically be recovered, but it's going to be a very big chore.


Too bad he wasn't running Illuminos (OpenSolaris) based servers (or even some Solaris versions) that would have just flat out refused to run rm -rf /


It's required in POSIX 1003.1-2013 that rm refuse to remove the root directory[0].

[0] http://pubs.opengroup.org/onlinepubs/9699919799/utilities/rm...


Recent-ish versions of GNU rm do the same. However, it only protects against `rm -rf /`, not `rm -rf /*`, as the latter is expanded by the shell.


It's illumos, not Illuminos.


Toy Story 2 was almost entirely deleted because of this same problem:

http://thenextweb.com/media/2012/05/21/how-pixars-toy-story-...


He says he's recovered almost all the data. FWIW. https://serverfault.com/questions/769357/recovering-from-a-r...


As part of my PhD research, I developed a shell scripting language (shill-lang.org, previously on hn: https://news.ycombinator.com/item?id=9328277) with features that provide safety belts against this sort of error. From speaking to administrators and developers, we believe these types of errors cause take much more worry and time than they are worth.

Now that I'm graduating, we've started the process of refining Shill into a product that we can offer to administrators and developers to make their lives simpler. If this sounds like a tool you wish you had (or if you wish a similar tool existed for your platform of choice), we'd love to hear from you.


According to a comment in the ServerFault website, he actually managed to recover the data [1]. He consulted a company for data recovery and they gave him a list with the files that they could manage to save [2].

1 : http://serverfault.com/questions/769357/recovering-from-a-rm...

2 : http://serverfault.com/questions/769357/recovering-from-a-rm...


Reminds me of https://archive.is/9R2j8

(Original thread has been deleted.)


That's awesome.

A company I left a while back recently had two servers accidentally rebooted through sort of automated task (probably puppet). The fine, I'm told, was one billion dollars.

Someway, somehow, he still works there. :)


Details, company name?


It's a very large French bank. I don't know anything except that it was a paired batch processing server and didn't push further questions, honestly. For some reason, I found absolutely nothing in the news about it, but my source of information is credible. It doesn't surprise me for a second that it happened, since, for example, I spent months trying to get these guys to fix their literally useless MQ DR failover scripts, but nothing ever came from it, since, they didn't have anywhere to test.

With the way they treat their employees - fucking good riddance. There was a giant mess when Disney forced their NOC to train their replacements, but yet these guys did the exact same thing, plus some, and there was no public awareness during or after it. The best part was their push to move everyone to Montreal. Lower pay, not a guaranteed extension and you're forced to move? Okay.

The AMRS CTO actually left about a month after he got the position and took me along with one other person over to a new company. Goldman's head of tech actually just left to go to the same place. Not gonna lie, it sounds incredibly suspicious, especially considering the the kinds of shenanigans that went on there... thankfully I'm no longer working there.

It's a very, very strange place in finance.


That just makes me really,really want to know what the original company was and what product they were selling.


Some kind of botnet access, probably.


I once had an incident with a server which triggered notification alerts about a failing httpd service. While I was looking into the issue, the mail service suddenly stopped working, then the database service went down - it was like a slow cascading failure, affecting all services on the server one after the other. I finally noticed the 'rm' command in the process list and asked the client if he ran any custom commands as root on the machine. Turns out he followed the instructions on a website to install some custom software without checking any of the commands and just copied & pasted them into the prompt. He basically managed to "rm -rf" on / and deleted his own server.

Luckily recent backups were available, so the damage was rather small, but it was interesting to see someone just pasting & executing commands without knowing what they actually do, especially when logged in as root.


Did he change his ServerFault username? I see commenters referencing @MarcoMarsala, but the OP's name appears to be bleemboy at the moment[0]

[0] http://serverfault.com/questions/769357/recovering-from-a-rm...


Looks like he did, the nick was MarcoMarsala earlier this week.


+1 for snapshotting filesystems.


Or any backup solution worth its salt, including "RAID1 where you yank out and replace one of the drives every other week".


RAID is not a backup.

... Especially when your RAID is busy rebuilding for N hours every other week.


Well RAID 1 would be a backup if you yanked out and replaced a drive every week. In that case it would be a weekly snapshot.


Unless you're mirroring across more than 2 drives, you have an AID setup.


You are incorrect. RAID 1 is a mirror setup. There are two drives with exactly the same information. One of the two drives is redundant. RAID 1 does not include striping and only requires 2 drives for redundancy.


I think what DDub is getting at is that there is no redundancy for the data received while the disk is mirrored to the new twin. For that, you'd need a mirrored pair plus a drive to yank out as the backup.


Ah, that's a good point. When you first put the fresh drive in it would be AID for a while... and no one likes AIDs.


Instead of doing system administration with root, couldn't we have a system user with the same privileges as root except it wouldn't have write access to the files of some users (like your clients) ?

So you could still rm -rf / all you want, delete everything but still have /home or /var/www content untouched.

We run certain programs with limited privileges to mitigates risks (bugs, exploits, etc.), why shouldn't we also limit the privileges of root to mitigates the risk of buggy system administration ?

Obviously having actual backups and testing your code before applying it to production is good practice but I feel like doing system administration with root while having potential bugs in your sysadmin code (as in any other software) leaves the door open to the next catastrophic failure.


Nothing to see here, it was a hoax/troll: https://meta.serverfault.com/questions/8696/what-to-do-with-...


with 'set -u' he could have stayed safe. Bash probably should never treat undefined variables as a valid but empty value, it's so dangerous.



>Together, the code deleted everything on the computer, including Mr Masarla’s customers' websites, he wrote. Mr Masarla runs a web hosting company, which looks after the servers and internet connections on which the files for websites are stored.

And he has no backups? Including rolling backups in unconnected storage?

>Mr Marsala confirmed that the code had even deleted all of the backups that he had taken in case of catastrophe. Because the drives that were backing up the computers were mounted to it, the computer managed to wipe all of those, too.

Then the probably probably deserved to die. Sorry for the customers though...


"Most users agreed that it was unlikely that Mr Marsala would be able to recover any of the data."

Perhaps they are unfamiliar with extundelete? http://extundelete.sourceforge.net


Sorry if that's a stupid question, but does that mean he was running his script as root?


He was running with full administrative permissions and thus file access permissions on files were ignored.

"root" is the name of the default administrative account on Unix and Unix like systems.


Thanks :)


Once i did sudo rm -R . when i was in /var

When i discovered what i had done and stooped it /var/www was already gone.

Luckily we had backups, but that sure did teach me a lesson about rm.

These days i look very carefully before using rm -R and also i type the entire path.


Sometimes seeing certain usernames can get you to accidentally write the wrong thing.


Reminds me of the guy who chose the Xbox Live gamer tag 'XBOX TURN OFF'


I managed to sudo chown -R {useless_user}:{useless_user} {foo}/ with foo undefined, whilst simultaneously distributing that command with dsh to our entire cluster of 10 machines. This was after testing that everything worked on the development machine. So of course, I retraced my steps to find out what went wrong, and killed the development machine too.

The upside is that we knew we had issues, and with everything broken the impetus is on the right people to ensure they're fixed before we get distracted by the next shiny feature.

Sometimes, setting your servers on fire is the solution to technical debt.


This made me think of this quote: https://twitter.com/devops_borat/status/41587168870797312

More seriously, this isn't the first I've heard of rm -rf backfiring - one of my friends said at one place he worked at, an IT guy walked out one day & never came back after trying to fix a co-worker's computer. He found out after by investigating on his co-worker's computer that the IT guy must have ran rm -rf while root & wiped out everything.


There are many ways you can safeguard [0] `rm`, but in the end, it's better to just use a tool that move files to the trash [1] instead.

[0]: https://github.com/sindresorhus/guides/blob/master/how-not-t...

[1]: https://github.com/sindresorhus/trash-cli


I lost the private key to one of my AWS servers after it had had a traffic spike due to blog coverage[1]. It was a toy system so it was using local storage, but then it became sort of popular. Luckily I had a process monitor set up so it managed months of uptime before something happened that I couldn't do anything to fix.

[1] http://waxy.org/2008/04/exclusive_google_app_engine_ported_t...


I would like to point out that requiring `set -u` at the top of all your production bash scripts will prevent this kind of disaster - the script will fail if unassigned variables are referenced.


If anyone knows bash, it's bashinator! :)


Is there ever a situation where someone would want to rm -rf / ?


Well, it is certainly fun.

I never bothered to count exact numbers, but from my experience, close to two thirds of all people, when presented with root shell and no consequences, will run rm -rf in some way.

Humble ones issue "rm -rf /usr" or "rm -rf /lib", others go straight to "/bin/rm -rf /". I've seen one person do "rm -rf /* ", immediately followed by "find / -delete". I'd really like to take a peek on his/her thought process at that moment, looked like the desire of destruction was really strong in that one particular brain ;-)

So yeah, while its not particularly useful one, there's indeed a situation where one definitely want to run it.

disclaimer: I run SELinux playbox with free root access and session recording, and peeking into what others do is also fun.

Edit: rm minus rf slash asterisk formatting.


Nope (almost never), which is why GNU rm requires the '--no-preserve-root' flag if you actually want to do that for some reason.


In the manpage for rm, I see "--no-preserve-root do not treat ‘/’ specially (the default)". For real? The default is to do the worst thing?


No, the default is to do nothing. The (default) refers to "treat root specially" not the flag.


What are the almost never situations?

I'd be super interested, because I cannot think of one at all :D


Cops / assailant about to bust down your door and its all you know / have time for?


Yeah so I typed rm -f * the other day after typing rm -f *~ repeatedly in a few different directories. In the 2 seconds it took me to realise, I lost a lot of data. First time I've made that particular typing slip-up in many years. Thankfully I had backups to restore from. Real heart-sink moment.

Sure, there should have been aliases for rm -i and I shouldn't have used -f etc etc etc. But sometimes this stuff is going to happen.


This is what comes from treating undefined variables as empty, rather than as errors. Bad language design in the shell.


One take-away from this is that it's probably better to save your backups somewhere where you can't delete them. Make sure that nothing using rm touches your database backups. Also, try to keep them backed up in multiple places. For example, store backups on a server you own, and on a cloud server, like on S3.


Another way this could have been avoided is if he used "--one-file-system" flag, which wouldn't delete backups as they were mounted on a separate filesystem.


Nearly done the same thing once by messing up ordering of flags. Thankfully this was before devops tools were present, so a ctrl-c stopped wiping before it got too deep, but a Friday afternoon dowtime is still bad.

Tape/blu-ray disk backups can come really handy in these cases, not being easy to wipe them.


I was expecting something much more subtle.

I guess the best course of action to prevent this would be to alias rm to a custom script, then parse the arguments to make sure the root directory is never recursively deleted, then calling rm from within your script.



OR "Man learns the value of backups because somethings things go wrong."


From the article: "All servers got deleted and the offsite backups too because the remote storage was mounted just before by the same script (that is a backup maintenance script)."


That's not a backup. If your "backup script" requires mounting the backup on a production machine, then it's barely a copy of your data.


How is this even remotely possible..?

- No developers have a local copy of code on their machines?

- No backups at all?

Worse case scenario, couldn't you attempt to retrieve the data from the hard-drive? Though, the database(s) would likely not be retrievable.


This is what I thought. I would have to go out my way to completely nuke the servers I work on. I'm trying to understand what structure this guys company had if everything can be mistakenly deleted without any chance of recovery.

Maybe I misread the article and he runs a niche hosting company that has different requirements, but it seems strange to me to be able to completely remove your online body of work in a matter of minutes.


He's a hosting company. He deleted his clients code/content.


According to the thread they were able to recover almost all of the data so far. So the whole "deletes his entire company" no longer seems accurate. Still pretty crazy.


Not a Unix admin .. but can you swap the rm command with a different executable that prompts you when the -rf option is specified?


You can just alias rm to a script of yours that does just that with like, one extra line of bash. I've done this for a couple of commands where I prefer default behavior that isn't specifiable by flags.


It's weird to me that commenting on a Stack Exchange question could get you quoted on several major news websites.


Shouldn't the proper fix be implying -i on -rf / At least in some fork of the coreutils.


Correct me if I'm wrong, but rm doesn't wipe data out, it just deallocates the disk space devoted to it. If you actually managed to wipe out your entire file system with rm you could likely still recover your data with a recovery tool.

This story smells a wee bit fishy to me.


It's just hard to get the filesystem entries for the file back. rm doesn't specifically wipe, you're right, but the filesystem entries are deleted, which means you basically have to grep the disk for bits of the file you want with known contents.


Oops




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: