What's most epic about this is it's in the UNIX Hater's Handbook. One of its rants was how better-designed systems would warn you if you were going to nuke your whole system. The reason is that a command to wipe the whole system was more likely a mistake than developer or admin's intent. UNIX would do it without blinking. Inherently unsafe programming and scripting combined with tools like that meant lots of UNIX boxes went kaput.
And today, over two decades later, a person just accidentally destroyed his entire company with one line without warnings on a UNIX. History repeats when its lessons aren't learned. This problem, like setuid, should've been eliminated by design fairly quickly after it was discovered.
EDIT: Added link to ESR's review of UNIX Hater's Handbook which links to UHH itself. Nicely covers what was rant, what was fixed, and what remains true. Linking in case people want to work on the latter plus my sour relationship with UNIX. :)
One of the more fascinating aspects of human history is how much effort we're willing to devote to creating varied senses of safety, even if that safety is only an illusion.
Now, we can call this a failure of design, but really, people who rely on technology they don't understand can't be saved by good design. Sure, this particular case could be fixed by disallowing the recursive flag on the file system root, but safety is never going to be able to be the primary design concern of any technological system.
Imagine if a sword were made safety as first-class concern. You can't design a sword that can be used safely by the untrained. No weapon can be, training with a weapon is a prerequisite for safely using it. Similarly, every technology has to be understood by those using it. If you don't you're just inviting trouble.
For a business using technology, the needs are actually fairly straightforward. You need an understanding of what needs to be backed up, and a process for performing the backups. If you've picked the former right, (backing up human-readable information rather than data only readable by software programs that might go away in a crash) then risk is minimized.
This is a strange post. You imply that, only because it's impossible to absolutely prevent all kinds of disasters, no efforts in safety should be taken at all.
By the same logic, you could strip away all the airbags, seatbelts, comfy seats and assistance systems of modern cars: After all, accidents still happen and safety mechanisms might even lure drivers into more reckless behavior. (This is in fact happening with seatbelts)
I think it's less useful to think about absolute safety than about which failures are likely to occur and how effective our measures against those specific failures are.
The --force flag is obviously not an effective measure against root deletions, otherwise we wouldn't have so many stories about it. My theory is that there are three reasons for it:
- As other people wrote, if you frequently batch-delete files, you get trained very quickly to always use -f as plain rm is very annoying to use for large sets of files. Unlike other flags, -f won't make you stop and think.
This could be fixed by making rm-without-f actually useable - for example by only asking once and not for every file like, oh I don't know... Windows.
- rm can interact with shell parsing in very intransparent and fatal ways. My guess is that most root deletions happen similar to this post: not a literal rm -rf / but some unfortunate variable interpolations where the author didn't realize that they can evaluate to "/". That's a very unobvious point of failure that takes a lot longer to learn than just using rm. Therefore rm should absolutely warn about it.
- there is actually an expectation that rm could be safe as most deletes you do on a modern system are reversible - either because you have a " recycle bin" or a backup. So a warning would make sense to counter that expectation.
'vinceguidry actually makes a pretty good point. It's one thing to cover potential stupid mistakes with safety features. But beyond some point, safety starts to oppose utility - i.e. a perfectly safe car would be a simple chair. A perfectly safe software is also one that is totally useless for anything.
It's important to consider when designing software that safety should be about gracefully handling mistakes, and not something that should lure the user into false sense of not having to know what they're doing. Unfortunately, the latter attitude is what drives todays' UX patterns and software design in general, which is a big part of why tech-illiterate people remain tech-illiterate, and modern programs and devices are mostly shiny toys, not actual tools.
It's true that safety and also security can impair the usefulness of something past a certain point. It's also irrelevant to our current topic given the existence of systems that don't self-nuke easily. This is a UNIX-specific problem that they've fought to keep for over 20 years with admittedly some improvement. There were alternatives, both UNIX setups and non-UNIX OS's, that protected critical files or kept backups from being deleted [at all] without very specific action from an administrator. And nobody complained that they couldn't get work done on or maintain a VMS box.
So, this isn't some theoretical, abstract, extreme thing some are making it out to be. It's a situation where there's a number of ways to handle a few, routine tasks with inherent risk. Some OS's chose safer methods w/ unsafe methods available where absolutely necessary. UNIX decided on unsafe all around. Many UNIX boxes were lost as a result whereas alternates rarely were. It wasn't a necessity: merely an avoidable, design decision.
I'm glad we have the same opinion then - as I said, it's not very useful to reason about "perfect safety".
It's certainly possible to make a product "safer" then necessary and hinder utility (though I think "safety" is the wrong concept to look at here - see below) but if the common opinion of your product from tech-illiterate people is "complicated and scary", I think you can be pretty sure that you are still a long way away from that point.
In fact, some versions of rm do add additional protection against root deletions, e.g. the --no-preserve-root flag. What utility did that flag destroy?
I believe if you really want to make people more tech-literate (which today's apps are doing a horrible job of, I agree), you have to give them a honest and consistent view of their system, yes.
But you also have to design the system such that they can learn and experiment as safely as possible and can quickly deduce what a certain action would do before they do it.
Cryptic commands, which are only understandable after extensive study of documentation, and which oh by the way become deadly in very specific circumstances don't help at all here.
"Cryptic commands, which are only understandable after extensive study of documentation, and which oh by the way become deadly in very specific circumstances don't help at all here."
Exactly. That's another problem that was repeatedly mentioned in UNIX Hater's Handbook. It still exists. Fortunately, there's distro's improving on aspects of organization, configuration, command shells, and so on. I'm particularly impressed with NixOS doing simple things that should've been done a long time ago.
> You imply that, only because it's impossible to absolutely prevent all kinds of disasters, no efforts in safety should be taken at all.
Not at all. We should absolutely work to make things safer. But we need to be realistic and temper our sense of idealism. Nothing was going to save this guy from disaster, if it wasn't 'rm' it just would have been something else.
My point is that you can't expect safety features to obviate the need to know what you're doing.
I don't know exactly how you'd plan on fighting reality. The fact that this guy was going to get hosed eventually isn't some justification, it's fact. People who do stupid things get burned.
If someone wants safety features, let them pay for them. If someone wants to add one, sure, so long as I can remove it if it gets in my way. Who knows, maybe they'll actually be worth having. But I'm not going to lose sleep over every idiot who ruins his life over something he didn't or couldn't learn about. There's absolutely nothing you can do to save stupid people from making stupid decisions.
Maybe I remove a safety feature I don't need and hurt myself with it. Now I'm the moron. Hopefully I learn from it. Nothing you could have done about that either.
Show me something foolproof, and I'll show you a greater fool.
In the UNIX Hater's Handbook, defenders of rm consider accidental deletion a "rite of passage" and remark that "any decent systems administrator should be doing regular backups" (see page 62). The author's response is funny:
“A rite of passage”? In no other industry could a manufacturer take such a cavalier attitude toward a faulty product. “But your honor, the exploding gas tank was just a rite of passage.” “Ladies and gentlemen of the jury, we will prove that the damage caused by the failure of the safety catch on our chainsaw was just a rite of passage for its users.” “May it please the court, we will show that getting bilked of their life savings by Mr. Keating was just a rite of passage for those retirees.” Right.
I'm surprised how relevant parts of this book are 22 years later.
Have an immutable filesystem, where "deletes" are recoverable by going back in time. At least until you do a scheduled "actual delete" that will reclaim disk space.
Practically speaking, if you're quick an 'rm' isn't totally destructive even without backups. There's a good chance your data is still there on the disk, it's just not associated with anything so it could be overridden at any point. Best to mount the disk read only and crawl through the raw bits to find your lost data (I recovered a week's worth of code this way several years ago).
My favorite answer to the common interview question: "What was your biggest mistake, and how did you recover from it?" Answer: Back in 1993, I once deleted a critical data file. Fortunately, the AIX host was sitting next to me, so I quickly reached over and flipped the power switch off. The strategy being: writes were buffered and flushed out periodically, so hitting the power switch prevented that last write from hitting the disk. And if this didn't work (and caused more file system corruption), well I would have needed to restore from backup anyway.
Right but if you delete your entire file system there won't be anything to come along and do the "actual delete" so you're safe until some one comes along with a rescue disk or otherwise mounts it to a system that knows how to deal with this.
At the very least when you rm important-file.txt instead of importanr-file.txt you have a chance.
Pre-delete would already hide files from apps and services for them to "fail fast", and actual delete would be just "i'm running fine for two days". Of course this implies that active open files should not be pre-deleted on unix at all (at least not by rm process). Even if you delete the entire filesystem with backups, there will be a chance to boot into recovery mode and undelete everything back. We can even go further and apply small-file-versioning on fs level to prevent misconfig accidents in /etc.
That's very simple and powerful, I can't tell why it is still not implemented today.
well there's some safe guard checks you can do to prevent the easy abuses like require an additional flag for rm -rf-ing everything in the root or /home/*/, or across device boundaries as some one said.
You can redesign rm so people don't find themselves typing -rf as force of habit.
You can have multiple delete commands: mark as overwriteable if need and remove from 'ls -a', actually delete, overwrite sectors with zeros; like we have in guis.
You can have a permission system that isn't just "this account can do literally anything" or "this account can't do anything"
here's a dumb idea off the top of my head: /etc/rmblacklist.conf autopopulated (by the distro) with a list of files (/boot and the actual bootimage for instance) that requires a GNU style long option --nuke to delete. It's still easily scriptable but you'd rarely ever actually need it and requiring a long option would serve as a double check that you meant it when you did.
Sure it's still possible to --nuke something due to a bug or negligence but I bet it'd cut down on fatal errors. Plus the user could have their own ~/.rmblacklist.conf to guard against particularly persistent dyslexias.
You could even erase the contents entirely if you really think being able to delete root on a whim keeps your kung fu strong.
Darn, I didnt notice it was that old. Makes situation even worse for UNIX rm defenders. Like when Trusted Xenix eliminated setuid vuln's mostly by clearing setuid bit during a write w/ admin having to manually reset it. Simple shit. Mainstream response? "Just audit all your apps for setuid and be extra careful in..." (sighs)
Use something like that with scripting. It really is that easy to protect critical stuff with otherwise easy-to-use rm. UNIX has always been resistant to do stuff like this. However, it's architecture makes it fairly easy for developers to do it themselves. That's to its credit.
That attitude was pervasive in both UNIX and C. That other systems avoided common problems of both with little work or penalty shows it's unnecessary today. Yet, the results of that attitude continue to do damage.
The author stretches his analogies substantially. It's nice to see this was happening 22 years ago just like it is nice to see UNIX from more than 22 years ago still being used today.
Yeah, I saw the same problem in the NSA backdoor debate where people liken it to brake failures and such. I pointed out repeatedly that computer breaches rarely maim or kill anyone. Usually don't even cost them their jobs or bankrupt their business. Such analogies are strawmen to try to boost their argument with an emotional response.
>You can't design a sword that can be used safely by the untrained. No weapon can be, training with a weapon is a prerequisite for safely using it.
I was tank crewman in the US military, and there is a high chance of death of dismemberment from regular tank operation. Drivers have very limited visibility, tank turrets can pin people as they traverse, the breech of the main gun violently recoils into the crew compartment with limited guards, the rounds can be exploded by static electricity (or just plain lit on fire), I've saw one guy smash all teeth out when riding inside and the tank hit a hole, and so on.
We had a saying: "tanks are designed to kill, they don't care who."
Then why does rm have the f flag in the first place? Clearly some one thought oh-hay-guys maybe a safeguard wouldn't be out of place. They just designed a really awful safety.
For the last 1000 years, swords have typically been constructed with a guard. It's a simple and useful safety feature, and it almost never prevents an intended use. Same for disallowing '/' as the target of rm.
I don't want to sound fussy, but sword guards aren't designed to protect the user from the sword, they're for protecting the user from other swords. That they make using a sword safer is a side-effect. I agree with the point you're making though.
After every disaster, you can come up with a process that would have stopped that specific disaster. That doesn't mean it's a good idea to implement that process everywhere.
It's too bad you're being downvoted. It was not too long ago that people blamed the Germanwings Flight 9525 crash on how the cockpit door was designed to protect against hijackers, post 9/11:
> Imagine if a sword were made safety as first-class concern.
This seems like a silly example. A weapon, meant to dismember and maim attackers of its owner, is one of those things that's impossible to completely make safe. Granted I could think of plenty of ideas that maybe would make it safer at first that wouldn't compromise its ability to be effective as a weapon but it's simply not an apt example to be used here.
A computer is a general purpose device. It can be used to help image cancer, launch a nuclear weapon or play games. Considering that it's meant to be used by everyone, without discrimination, it seems to make sense that you need to do the best you can to protect the user from themselves.
I worked in Apple Care support for about a year. The majority of your users are not going to know all of the consequences to their actions, even ones doing system administration (because let's face it almost every company in the world needs at least a little of that now and not all of them are going to hire someone who knows what they're ding).
You can't protect a user from everything. But when you can protect a user from doing something that would have screwed up their whole system, lost a project, etc? That's helpful. Correcting input is what computers are essentially there for.
Interesting points. I think systems like Burroughs counter the concept in the a lot of safety can be baked into a system. Here's what they did in 1961:
Notice that's good UI design for the time, hardware elimination of worst problems, interface checks on functions, limits of what apps can do to system, and plenty recovery. Systems like NonStop, KeyKOS, OpenVMS, XTS-400, JX, and so on added to these ideas. You can certainly bake a strong foundation of safety into a system while allowing plenty flexibility.
So, for example, critical files should be write-protected except for use by specific software involved in updates or administrative action. Many above systems did that. Then, one can use VMS-style, versioned filesystem that leaves originals in there in case a rollback is needed so long as there's free space for that. Such a system handling backups and restores with modern-sized HD's wouldn't have nuked everything. Might have even left everything if using lean setup but can't say for this specific case.
"You can't design a sword that can be used safely by the untrained."
A sword is designed to do damage. A better example would be a saw that's designed to be constructive but with risk of cutting your hand off. Even that can be designed to minimize risk to user.
"If you've picked the former right, (backing up human-readable information rather than data only readable by software programs that might go away in a crash) then risk is minimized."
That's orthogonal. A machine-readable format just needs a program to read it. The risk is whether the data is actually there in whole or part. This leads to mechanisms like append-only storage or periodic, read-only backups that ensure it's there. Or these clustered, replicated filesystems on machines with RAID arrays that lots of HPC or cloud applications use. Also, multiple, geographical locations for the data.
People doing the above with proven protocols/tools rarely loose their data. Then there's this guy.
Table saws should never be used on flesh. rm(1) should always be used on files. How in FSM's noodly universe is the command supposed to intuit which files it should safely delete versus those it shouldn't?
> ...or administrative action.
You mean like, "sudo rm -rf {$undefined_value}/{$other_undefined_value}"? D'oh!
Two different people here have already figured out this wouldn't have happened in OpenVMS due to versioned filesystem w/ rollback. People also claim saner commands for this stuff but I can't recall if remove was smarter.
He really should not have made the first element of the path variable. Doing an "rm -rf /folder/{$undefined_value}/{$other_undefined_value}" would have made his day much better.
Also, never having all backup disk volumes mounted at the same time is good practice.
There's also the phenomenon that people have an inherent tolerance of risk, so the "safer" you make something, the more reckless people tend to be.
When traction control and antilock brakes became mainstream, one result was that some people started driving faster on snowy roads, up until their risk tolerance was the same as before.
If you understand that a typo can destroy your business, you'll be careful to not log in as "root" on a routine basis and double check everything you do and keep good backups. On the other hand if you expect the system to prevent you from doing anything really damaging, you might be more careless about your approach.
> You can't design a sword that can be used safely by the untrained. No weapon can be
This is a really weird argument, since weapons are used in combat, which is not 'safe' by definition.
But if you do want a weapon that the untrained can use without much chance of hurting themselves, look to a spear. It was the go-to weapon for untrained militias from the time history began up to gunpowder taking over - and even then, bayonets are stuck on rifles to turn them into spears.
A counter that applies to almost every comparison the opposition brings up. Further, swords don't have an easy solution to stop problems that fits in a comment or two in this same thread. ;)
$ sudo rm -rf /
rm: it is dangerous to operate recursively on ‘/’
rm: use --no-preserve-root to override this failsafe
Relatively modern distro, but this has been in coreutils for awhile (it was fixed in Ubuntu's coreutils in 2008 for instance)
$ egrep '^(NAME|VERSION)=' /etc/os-release
NAME="Red Hat Enterprise Linux Workstation"
VERSION="7.2 (Maipo)"
Yes, I just did this on the workstation I'm typing on. I'm somewhat curious if he did that on absolutely ancient distros (>8 year old), or didn't actually run rm -rf, but the ansible python equivalent.
Someone brought that to my attention on Schneier's blog suggesting the post was trolling. I'm holding off on going that far for now as some details might be missing & this sort of thing has happened many times. I don't use Ansible. Does it have the --no-preserve-root or other modifiers necessary on modern distros already in it?
I feel like if you use `rm -rf` particularly `--force` with privileges, it shouldn't be the job of Unix to stop you.
Also tangentially, if you don't have a sensible backup in place that would protect you from (or at least mitigate) a complete wipe of a single machine (or even all primary ones), you are doing something wrong.
The problem with that is people are trained to use -f because it's so annoying to try and use rm without it.
Really, the -f flag should just mean "don't ask for confirmation" and a separate flag should be required to mean something like "yes I do want to nuke my computer". And maybe there should be a flag that means "cross device boundaries", and by default it could refuse to delete anything that has a device number different than the argument it started with. That would at least prevent you from nuking your network-attached storage.
This is probably a far better solution than trying to make rm smarter about which kinds of files it's somehow either "safe" or "unsafe" to unlink. Even then, though, you'd still get people who invoke it with "--across-devices --yes-i-really-mean-it" set, with unexpected and disastrous consequences.
And then someone will come along and bitch that rm isn't safe enough yet again.
Well, the point is that --across-devices won't be a flag you use as a matter of habit, it's something you'll only use when rm tells you it won't cross the device boundary and you realize that, yes, you really do want to delete across a device boundary. So you'll only add it in the specific cases where it's warranted, and you'll have already tried rm without it (to verify that you're deleting the right thing).
Come to think of it, I don't think I've ever used rm to delete across device boundaries. It just doesn't seem like an action you usually want to take.
... so then you change it to be more safe yet again.
I don't understand this attitude. Of course software isn't perfect; it's not even close, it's pretty awful. But the best thing about it: it's malleable. When things don't work, you change them to work better.
When things don't work, you change them to work better.
I'd submit that adding layer upon layer of complexity to prevent all the myriad stupid things people might do using a particular piece of software isn't axiomatically "better".
Maybe if lives depend on its correct function, it's worth it, but that kind of strict requirements gathering and execution is well-understood by the people who live in that world.
Making sure that J. Random DevOps Dude doesn't foot-gun himself when he's paid to know better isn't that.
But J. Random DevOps doesn't foot-gun himself he foot-guns the whole Op. You can fire him because it turns out he didn't know better but it's not going to fix the problem he created, the problem now affecting the whole company.
Believe me, I know this. Entirely too well, in fact.
At my last job, the senior DevOps dude foot-gunned the entire company, by running a read-write disk benchmark (using fio(1)) against the block device (instead of a partition, which, while still stupid, would at least not have been actively destructive) on both my master and all of my slave PostgreSQL hosts. At the same time. And, of course, without telling anyone what he was doing, so the first inkling I had that there was a problem was about 20 minutes later, when I started getting a steadily increasing number of errors suggesting disk corruption.
How does one make such a tool drool-proof enough to prevent that kind of idiocy? Please, help me figure that out. And then give me a time machine, because that was a 16-hour day I'd really rather not have experienced.
And, no, the right move is generally not to fire the jackass who makes that kind of mistake. In my case, above, the company spent about three quarters of a million dollars (just in revenue, never mind how much time was burned in meetings about the incident, my efforts to fix the problem, as well as his and the rest of his team's efforts, and so on) teaching him never to do that again. You don't buy lessons that expensively and then let someone else benefit from them.
(That said, he did get fired several months later for telling the entire engineering lead team to fuck off, in so many words, for their having made a perfectly reasonable request, which was entirely within his responsibilities, and his skills, to satisfy.)
You should not use it the first time and verify that you actually need it. The second's pause may save your data :) If you're feeling rash, get up and walk around until it passes.
When I was in seventh grade, I took an Industrial Arts class ("woodshop"). The first few weeks of the course were spent going over safety of the machinery. In particular, I remember a heavy-handed message that Mr. Hopfer gave at the drill press:
This is a piece of industrial machinery. It is not a toy. If you put your hand on the stage and lower the bit, the machine will not jam up and make funny noises because it is too difficult. Instead, it will drill a hole through your hand. That is what makes it useful. If it didn't do that, you wouldn't be able to cut through wood.
There are actually ways to make it so it will in fact stop when contact with one's hand (for saws but you could see an analogous system for a drill press):
That will also apparently stop when in contact with wood that's recently been cut down or is damp, destroying the (expensive) brake and blade in the process. It even has a bypass mode for that reason.
Luckily for us, computers are slightly more sophisticated than drills and we can incorporate sane safety checks with relative ease. In this case people are just asking for an explicit flag in the rare situation when they do want to delete everything.
I really hate this type of attitude. Just because .0001% of users want to do something, doesn't mean the other 99.9999% need to suffer for it.
Are you against aircraft collision avoidance too? If your pilot wants to fly into another plane, then the guidance system shouldn't try to stop him, right?
`rm -r` does give warnings. You have to intentionally turn them off with the -f flag. It sounds like the -f has become too standard, a default flag, which undermines its entire purpose. You should only be using -f if you're absolutely 100% sure what you're doing, and that clearly wasn't the case here.
A question, I checked the manpage for -f (I don't really ever use it on purpose, rm seems to work fine for most of my file deleting needs).
It says "ignore nonexistent files and arguments, never prompt".
Seems to me that the "never prompt" behaviour is an important requirement for a backup-script, though? Cause backup-scripts should work unattended, and definitely not pause and wait for input under any circumstances, right?
True, but backup scripts that make use of rm -rf should have been really thoroughly tested, and contain a check that they are really only about to delete the thing they're supposed to delete.
Also, perhaps they shouldn't be running as root. There's no reason why any script should have permission to write everywhere. It just needs write access on the backup device.
He destroyed his company with a series of poor decisions. Running "rm -rf /" was just the last one. And it was the last one because it successfully destroyed the company.
Yep, it was an improper backup and recovery system that did him in. I'm sure many other businesses have been destroyed by the same, whether or not it was an rm -rf that was the trigger.
Not sure I agree. My grandpa had a saying: "It's impossible to make things fool-proof because fools are so ingenious." No matter how many safety checks you add to a system, someone will find a way to fuck it up.
Poka-yoke (ポカヨケ?) is a Japanese term that means "mistake-proofing". A poka-yoke is any mechanism in a lean manufacturing process that helps an equipment operator avoid (yokeru) mistakes (poka). Its purpose is to eliminate product defects by preventing, correcting, or drawing attention to human errors as they occur. The concept was formalised, and the term adopted, by Shigeo Shingo as part of the Toyota Production System. It was originally described as baka-yoke, but as this means "fool-proofing" (or "idiot-proofing") the name was changed to the milder poka-yoke.
This question that went unanswered in the replies bears repeating:
Any idea why the command actually ran? If $foo and $bar were both undefined, rm -rf / should have errored out with the --no-preserve-root message.
The only way I can think of that this would have actually worked on a CentOS7 machine is if $bar evaluated to , so what was run was rm -rf /.
As the above notes, I'm pretty sure recent versions of Redhat/CentOS actually protect against this sort of thing.
On the offchance you're not running a recent server, however, this could also be avoided by using `set -u` in the bash script, as it would cause undefined variables to error out.
Another lesson to be learned is that it's exceptionally bad practice to use Ansible to push out shell scripts that can be handled by native Ansible modules: http://docs.ansible.com/ansible/file_module.html
I feel like there is nobody replying who tested this theory because their computers don't work anymore.
It is a tragic story but rm -rf has been almost a joke in the industry for a very long while now. Even really old systems should have received an update of some form, to such an extent that the story in the op would be ridiculous rather than a discussion topic.
When I use the command I need to block out all distractions. I check my surroundings for things which might fall on my keyboard. I borderline make sure my phone is turned off before I carefully begin typing that.
I feel uncomfortable typing it into hacker news anywhere but the middle of a sentence. I can't imagine the bullets I would be sweating while deploying a bash script to all servers that included it. There is a problem that needs to be addressed.
Before you apply `set -u` to all of your scripts, be aware that an empty array counts as undefined. So if you have arrays for which being empty is valid, be sure to `set +u` right before accessing them (and then `set -u` again after).
This is why you should never run rm -r with the -f switch the first time. -f says yes to all warnings, including the ones that confirm that you really mean to erase root. I see this constantly in SO and other answers, and it's a really really bad practice to do it without thinking.
This was inside a shellscript though. It could well be that the script e.g. needed to remove write-protected files, so the flag really had to be there.
I wish git was better about having a deletable directory. Most of the time that I need to use the -f switch, it's because I get this trying to remove a .git directory:
I'm a bit surprised the newspaper did not validate the source? They're basically quoting a Super User / Stack Exchange thread which was probably a troll.
If a hosting company had deleted 1535 client accounts, we would have heard other stories about it from angry clients?
It's kinda strange that major "mainstream" news publications are publishing articles with web board posts as the primary source, indeed, the entire story itself, with basically zero extra work. They could at least try to contact the guy, interview him, make sure he at least seems legit.
That's not too surprising, unfortunately. With the internet and the obsession with getting news out as quickly as possible (because hey, being the first to report something gives you a lot more backlinks and clicks), the standard of proof for a story has gone from 'a lot of evidence gathered through actual investigation' to 'someone said this on an internet forum somewhere'. Basically, the internet and social media rewards quick reporting, not accurate reporting or verification.
It's still better than a lot of gaming news sites though, where 'some guy mentioned something on Twitter/Reddit/4chan' is suddenly front page news within ten minutes.
There's a specific reason why some called it fake: as adviced by others, Masarla ran the dd command to save the raw content of the disks, in case the recovery process screws up somehow.
He inverted the `if` and `of` arguments. You'd expect him to pay attention, after what happens. This doesn't pass the smell test for some.
Then again, you'd also expect him to be quite stressed out. That does make that mistake a bit more likely.
Possibly, but the dd command does have more affordances than say, ld. `if` and `of` do stand for input file and output file, and it's harder to swap named arguments than positional arguments.
He could have swapped arguments like `/dev/sdc` and `/dev/sde` though…
The quotes in the article definitely does not match the diction you typically hear in a SE thread. SE users typically chime in with helpful logic-driven advice—the advice they gave this unfortunate person sounded more like 4chan or reddit—"You're screwed!!"
This felt a bit unlikely, but what really convinced me it was a troll was a follow-up comment where he said he accidentally switched "if" and "of" in a dd command.
Does this article anything else than paraphrase the Serverfault thread? On the first reading I thought they had contacted the poor fellow to confirm his identity or something, but rereading, it would appear that they didn't: no further information than the original source.
(The user who asked that question uses now a nick, but had the real-soundish name mentioned in the article when I read that serverfault question first time earlier this week.)
edit. ...I really hope there isn't a real Marco Marsala someone pretended to be. Search engine results for that name are not great ATM.
While typing quickly I tab-completed to 'Makefile' and hit enter. Although it was a Makefile, it was executed as a bash script. bash ignored the incorrect syntax and executed line 10:
rm -rf $(DIRNAME)/*
If make parsed the file, $(DIRNAME) would have been nonempty. But it was empty under bash.
--no-preserve-root did not protect against this, because the target of the command was '/*'
How/why does this work without "#!" at the beginning of the file? I just tested with fish shell, and I get "Exec format error. The file './x' is marked as an executable but could not be run by the operating system." But, from bash or dash it does execute commands from the file.
That shouldn't happen if you look at fs/exec.c (search_binary_handler) there isn't a "fallback to shell" option. And fs/binfmt_script.c doesn't fall back to shell either. Are you sure you don't have some weird binfmt_misc hook enabled?
Years ago, I worked in an investment bank and we had a programmer put a batch program into production that executed the following as a shell command:
rm -rf foo /
It was supposed to be:
rm -rf foo/
It didn't run as root, but still managed to wipe out all the business data files. What saved us was that the servers were configured with RAID 1 and before the start of the nightly batch cycle, the mirror was "split" and only one copy mounted.
So we just had to restore the missing files from the other half of the mirror to revert to the start of the batch window and rerun the entire night's jobs.
So in other words, the acceptance protocol for this script was inadequate.
In the minicomputer era, it was common for a programmer to be required to run it on this one poor donkey of a machine to make it caught nothing on fire before moving to the big machine.
It was the 1990s, not that it's an excuse but TDD wasn't a "thing" yet, nor was version control (at least not in that shop). For every change to a program we printed the diff, attached a cover memo, and filed it in a cabinet.
Yes, there were test systems and programers were supposed to test all their changes, but they also were the ones who deployed their own changes to production so there were ways for this to happen pretty easily.
If the programmer followed proper review and change control procedures, nothing should happen. Everyone writes bad code once in a while; in this particular case, the bug happened to be more catastrophic, but that's more bad luck than anything else.
We have code reviews and change controls not only to reduce the number of defects, but also to provide cover when mistakes inevitably slip through.
I don't believe that's the truth, for a second. Of course the independent didn't look at the company in question to see if there was any litigation between this guy and his customers.
The real news here is that the Independent will write a feature story on a successful forum troll. Where were they back in the days of the Fucked Company message board when we could have used their help?
This is one of the reasons why we have infrastructure as code now, so system changes can be reviewed and tested just like application code, and more types of accidents can be reverted via source control :)
In the article the guy is using ansible. He even had off-site backups, but they were mounted before his ansible playbooks ran, wiping them out as well.
Another dangerous command is `crontab` with no arguments. It reads a new crontab from standard input. If you type Ctrl-C, it will abort and leave your existing crontab in place. If you type Ctrl-D, you've just created a new empty crontab and clobbered your old one.
My personal crontab is in a separate file in a source control system. I don't use `crontab -e`; I edit that file and feed it to the `crontab` command.
(It would be nice if HN handled backticks the way they're done in Markdown.)
A customer did this on his server once while I sat with him to add something. Since restoring the crontab from backups would have been a little inconvenient for such a small thing, I grepped the log files what commands were run by cron in what interval and had it rewritten in a few minutes.
I've managed that one as well. One month into a new job and working on servers with complex applications installed that I wasn't yet familiar with. Thankfully I had `crontab -l` just beforehand, otherwise I'd have been screwed.
Indeed. I do generally keep backups of crontabs, not just in case of this kind of scenario but also in case the platform blows up in any unexpected ways. Sadly the company in question didn't. However I have since made it a personal policy to always -l before editing so I have a "backup" in my tmux scroll back (that time I mentioned before, it was pure chance that I had -l)
Some platforms it's -r, others it's -d. I suspect it's down to which cron daemon you run but never really cared enough to investigate. In any case, both are next to the 'e' key so either are just as dangerous in terms of typos.
A competent specialist will be able to help that guy. rm -rf / is easily fixable if you won't mess around after it. Backups usually have recognizable format, so it's possible to restore backups and then everything from backups.
...and this is why we include a 'backup technology' question in our technical interviews--where 'offsite backup' must follow with something like "possibly the most important type of backup because..."
You know the sad thing is that even this isn't idiot-proof and needs to be qualified. One of my customer's brilliant "cost-saving" measures was to have an offsite backup solution that was basically an rsync script that ran every 15 minutes.
So when someone on their end did something catastrophic to their data and it took them an hour to notice, they were incredulous that we couldn't help them restore their data even though it was "backed up offsite!" because their "backup" solution had already caught up and duplicated the broken data.
And that's why if you're using rsync, you ought to be using rsnapshot instead, and have generations of backups so that you are not overwriting your most recent one.
>>the code had even deleted all of the backups that he had taken in case of catastrophe. Because the drives that were backing up the computers were mounted to it, the computer managed to wipe all of those, too.
If it's a duplicate copy of data intended in case of failure then yes it is a backup. Its not an offsite backup, but many people don't keep their personal backups offsite.
It is so deficient as a backup that I don't think it qualifies to be called a backup. That was my point.
Backups are expected to protect against data loss for a number of different failure cases (eg. disk failure, hardware fault leading to slow filesystem corruption, fire/theft, failed upgrade, "undo" for accidental change or deletion). There is a point where something addresses so few of these failure cases that you can't reasonably call it a backup.
Redundancy is there for fast recovery times (even zero downtime depending on how redundancy is implemented). It's not intended to run as a backup as redundancy devices are live and can fail from many of the same causes that will take your primary devices offline (fire, sysadmin fail, etc)
Likewise, if your "backups" are always online then it works better for business continuity than it does as a backup. So realistically it's more of a redundancy share.
If you're not doing offsite and cold backups, then you're just asking for trouble. If not crap like this then a fire or a ransomeware infection or a malicious employee, etc.
He actually was doing a remote backup (although probably not a cold backup). Unfortunately, he had used mount instead of rsync over ssh, making it vulnerable to the rm -rf command.
Just using rsync to make copies isn't a backup. If you use rsnapshot (which stores each copy separately) then you have a backup. Copies are not sufficient if you find out that something broke three weeks ago.
While as others already pointed out this story seems a little fishy, it serves well to reflect if something like this could in theory happen to your infrastructure.
Do you have your backup servers in the same configuration management software (ansible, puppet, ssh-for-loop etc) as the rest of the servers? One grave error (however unlikely) in your base configuration really can take down everything together in one fell swoop.
How "cold" are your backups? If the backup media are not physically disconnected and secured, you can most likely construct a scenario where the above, malware, a hacker or a rouge admin could destroy both the backups and the live data.
I will certainly suggest some additional safeguards for our backups.
Yep, that's what I hope everyone will be doing... thinking about their own backups and infrastructure.
We have backups off-site on disconnected media, so that alone prevents the kind of accident we're talking about.
We use btrfs send / receive to send OS images from the primary container host to the backup container host. The snapshots are read-only, so I'm fairly sure I can't just 'rm -rf' them, I'd have to actually 'btrfs subvolume delete foobar' them.
I should try that though on one of the test servers...
Any script that includes rm -rf followed by variables in a path is an accident waiting to happen. Mounting the backup volumes is just icing on the cake for this extremely incompetent web hosting provider.
It made me nervous to type rm -rf in this comment form. Those letters are dark magic.
Maybe things have changed, but rm doesn't zero out the drive. And with the backup that was rm too it should all be recoverable. Or am I missing something?
Not directly no, but some FS give you a hard time to recover the file structure, which in some cases is a big problem.
You could probably recover files, but if the backups aren't stored in a tar/zip/... file, it will be hard to recover both the data and the structure.
Most of the data is probably still there on the drive. But the filesystem data that says where it all actually is stored is probably irreplaceably gone. If some of that can be recovered then it should be possible to recreate individual files. Without it someone would have to guestimate where all the files are and then maybe manually piece them together (a single file may be in fragments in different parts of the disk). They'd also have to differentiate old deleted versions of files from the most recent deleted versions.
So yeah, it could technically be recovered, but it's going to be a very big chore.
As part of my PhD research, I developed a shell scripting language (shill-lang.org, previously on hn: https://news.ycombinator.com/item?id=9328277) with features that provide safety belts against this sort of error. From speaking to administrators and developers, we believe these types of errors cause take much more worry and time than they are worth.
Now that I'm graduating, we've started the process of refining Shill into a product that we can offer to administrators and developers to make their lives simpler. If this sounds like a tool you wish you had (or if you wish a similar tool existed for your platform of choice), we'd love to hear from you.
According to a comment in the ServerFault website, he actually managed to recover the data [1]. He consulted a company for data recovery and they gave him a list with the files that they could manage to save [2].
A company I left a while back recently had two servers accidentally rebooted through sort of automated task (probably puppet). The fine, I'm told, was one billion dollars.
It's a very large French bank. I don't know anything except that it was a paired batch processing server and didn't push further questions, honestly. For some reason, I found absolutely nothing in the news about it, but my source of information is credible. It doesn't surprise me for a second that it happened, since, for example, I spent months trying to get these guys to fix their literally useless MQ DR failover scripts, but nothing ever came from it, since, they didn't have anywhere to test.
With the way they treat their employees - fucking good riddance. There was a giant mess when Disney forced their NOC to train their replacements, but yet these guys did the exact same thing, plus some, and there was no public awareness during or after it. The best part was their push to move everyone to Montreal. Lower pay, not a guaranteed extension and you're forced to move? Okay.
The AMRS CTO actually left about a month after he got the position and took me along with one other person over to a new company. Goldman's head of tech actually just left to go to the same place. Not gonna lie, it sounds incredibly suspicious, especially considering the the kinds of shenanigans that went on there... thankfully I'm no longer working there.
I once had an incident with a server which triggered notification alerts about a failing httpd service. While I was looking into the issue, the mail service suddenly stopped working, then the database service went down - it was like a slow cascading failure, affecting all services on the server one after the other. I finally noticed the 'rm' command in the process list and asked the client if he ran any custom commands as root on the machine. Turns out he followed the instructions on a website to install some custom software without checking any of the commands and just copied & pasted them into the prompt.
He basically managed to "rm -rf" on / and deleted his own server.
Luckily recent backups were available, so the damage was rather small, but it was interesting to see someone just pasting & executing commands without knowing what they actually do, especially when logged in as root.
You are incorrect. RAID 1 is a mirror setup. There are two drives with exactly the same information. One of the two drives is redundant. RAID 1 does not include striping and only requires 2 drives for redundancy.
I think what DDub is getting at is that there is no redundancy for the data received while the disk is mirrored to the new twin. For that, you'd need a mirrored pair plus a drive to yank out as the backup.
Instead of doing system administration with root, couldn't we have a system user with the same privileges as root except it wouldn't have write access to the files of some users (like your clients) ?
So you could still rm -rf / all you want, delete everything but still have /home or /var/www content untouched.
We run certain programs with limited privileges to mitigates risks (bugs, exploits, etc.), why shouldn't we also limit the privileges of root to mitigates the risk of buggy system administration ?
Obviously having actual backups and testing your code before applying it to production is good practice but I feel like doing system administration with root while having potential bugs in your sysadmin code (as in any other software) leaves the door open to the next catastrophic failure.
>Together, the code deleted everything on the computer, including Mr Masarla’s customers' websites, he wrote. Mr Masarla runs a web hosting company, which looks after the servers and internet connections on which the files for websites are stored.
And he has no backups? Including rolling backups in unconnected storage?
>Mr Marsala confirmed that the code had even deleted all of the backups that he had taken in case of catastrophe. Because the drives that were backing up the computers were mounted to it, the computer managed to wipe all of those, too.
Then the probably probably deserved to die. Sorry for the customers though...
I managed to sudo chown -R {useless_user}:{useless_user} {foo}/ with foo undefined, whilst simultaneously distributing that command with dsh to our entire cluster of 10 machines. This was after testing that everything worked on the development machine. So of course, I retraced my steps to find out what went wrong, and killed the development machine too.
The upside is that we knew we had issues, and with everything broken the impetus is on the right people to ensure they're fixed before we get distracted by the next shiny feature.
Sometimes, setting your servers on fire is the solution to technical debt.
More seriously, this isn't the first I've heard of rm -rf backfiring - one of my friends said at one place he worked at, an IT guy walked out one day & never came back after trying to fix a co-worker's computer. He found out after by investigating on his co-worker's computer that the IT guy must have ran rm -rf while root & wiped out everything.
I lost the private key to one of my AWS servers after it had had a traffic spike due to blog coverage[1]. It was a toy system so it was using local storage, but then it became sort of popular. Luckily I had a process monitor set up so it managed months of uptime before something happened that I couldn't do anything to fix.
I would like to point out that requiring `set -u` at the top of all your production bash scripts will prevent this kind of disaster - the script will fail if unassigned variables are referenced.
I never bothered to count exact numbers, but from my experience, close to two thirds of all people, when presented with root shell and no consequences, will run rm -rf in some way.
Humble ones issue "rm -rf /usr" or "rm -rf /lib", others go straight to "/bin/rm -rf /". I've seen one person do "rm -rf /* ", immediately followed by "find / -delete". I'd really like to take a peek on his/her thought process at that moment, looked like the desire of destruction was really strong in that one particular brain ;-)
So yeah, while its not particularly useful one, there's indeed a situation where one definitely want to run it.
disclaimer: I run SELinux playbox with free root access and session recording, and peeking into what others do is also fun.
Yeah so I typed rm -f * the other day after typing rm -f *~ repeatedly in a few different directories. In the 2 seconds it took me to realise, I lost a lot of data. First time I've made that particular typing slip-up in many years. Thankfully I had backups to restore from. Real heart-sink moment.
Sure, there should have been aliases for rm -i and I shouldn't have used -f etc etc etc. But sometimes this stuff is going to happen.
One take-away from this is that it's probably better to save your backups somewhere where you can't delete them. Make sure that nothing using rm touches your database backups. Also, try to keep them backed up in multiple places. For example, store backups on a server you own, and on a cloud server, like on S3.
Another way this could have been avoided is if he used "--one-file-system" flag, which wouldn't delete backups as they were mounted on a separate filesystem.
Nearly done the same thing once by messing up ordering of flags. Thankfully this was before devops tools were present, so a ctrl-c stopped wiping before it got too deep, but a Friday afternoon dowtime is still bad.
Tape/blu-ray disk backups can come really handy in these cases, not being easy to wipe them.
I guess the best course of action to prevent this would be to alias rm to a custom script, then parse the arguments to make sure the root directory is never recursively deleted, then calling rm from within your script.
From the article: "All servers got deleted and the offsite backups too because the remote storage was mounted just before by the same script (that is a backup maintenance script)."
This is what I thought. I would have to go out my way to completely nuke the servers I work on. I'm trying to understand what structure this guys company had if everything can be mistakenly deleted without any chance of recovery.
Maybe I misread the article and he runs a niche hosting company that has different requirements, but it seems strange to me to be able to completely remove your online body of work in a matter of minutes.
According to the thread they were able to recover almost all of the data so far. So the whole "deletes his entire company" no longer seems accurate. Still pretty crazy.
You can just alias rm to a script of yours that does just that with like, one extra line of bash. I've done this for a couple of commands where I prefer default behavior that isn't specifiable by flags.
Correct me if I'm wrong, but rm doesn't wipe data out, it just deallocates the disk space devoted to it. If you actually managed to wipe out your entire file system with rm you could likely still recover your data with a recovery tool.
It's just hard to get the filesystem entries for the file back. rm doesn't specifically wipe, you're right, but the filesystem entries are deleted, which means you basically have to grep the disk for bits of the file you want with known contents.
And today, over two decades later, a person just accidentally destroyed his entire company with one line without warnings on a UNIX. History repeats when its lessons aren't learned. This problem, like setuid, should've been eliminated by design fairly quickly after it was discovered.
http://esr.ibiblio.org/?p=538
EDIT: Added link to ESR's review of UNIX Hater's Handbook which links to UHH itself. Nicely covers what was rant, what was fixed, and what remains true. Linking in case people want to work on the latter plus my sour relationship with UNIX. :)