well, the rm example is overly simple on purpose - the only thing that -f is actually going to do that's remotely dangerous is removing files that have the readonly bit set. I've never actually been bitten by that. In general though, I think this pattern scales poorly - the more complicated your task is, the more like the "force it" mode is going to be more and more dangerous.
---
On the subject of what to do when something goes wrong:
Sometimes retrying installing a package does fix the problem: if there was a network error, for example, and you downloaded an incomplete set of files, the next time you run it it will be fine.
If your package manager goes off the rails and gets your system into an inconsistent state, then you have a decision to make. Is this going to happen again? If not, just fix the stupid thing manually: there's no point in automating a one-time task. If it is probably recurring, then, you need to write some code to fix it (and file a bug report to your distro!). I do not believe that there is a safe, sane way to pre-engineer your automation to fix problems you haven't seen yet!
In the meantime maybe your automation framework stupidly tries to run the install script every 20 minutes and reports recurring failure. The cost of that is low.
Docker is awesome, for sure, and I'll definitely use it on my next server-side project. It isn't a magic bullet, though - you still have to configure things, they still have dependencies. Just, hopefully, failures are more constrained.
---
and on the point of upgrading for security fixes: the sad reality is that even critical fixes for security holes must be tested on a staging environment. No upgrade is ever really, truly guaranteed to be safe. I guess if the bug is bad enough you just shut down Production entirely until you can figure out whether you have a fix that is compatible with everything.
well, the rm example is overly simple on purpose - the only thing that -f is actually going to do that's remotely dangerous is removing files that have the readonly bit set.
Since you originally outlined the requirements as:
Take "rm" as a trivial example - when I say `rm foo.txt`, I want the file to be gone.
then the file should be gone even if "the readonly bit" was set.
This is not only a contrived example, but a bad one, for system management. rm is an interactive command line tool, with a user interface that is meant to keep you from shooting yourself in the foot. rm is polite in that it checks that the file is writable before attempting to remove it and gives a warning. System management tools I would expect to call unlink(2) directly to remove the file, which doesn't have a user-interface, rather than run rm.
However, the system management tool doesn't start with no knowledge of the current state of the system, but rather one that is known (or otherwise discoverable/manageable). And then attempt to transform the system into a target state. They can not be expected to transform any random state into a target state. As such, the result of unlink(2) should be reported, and the operator should have the option of fixing up the corner cases where it is unable to perform as desired. If you've got 100 machines and 99 of them are able to be transformed into the target state by the system management tool and one of them is not, this isn't a deficiency of the system management tool, but most likely a system having diverged in some way. Only the operator can decide if the divergence is something that can/should be handled on a continuous basis, by changing what the tool does (forcing removal of a file that is otherwise unable to be removed, for example), or fixing that system, after investigation.
The other option is to only ever start with a blank slate for each machine and built it from scratch into a known state. If anything diverges, scrap it and start over. This is an acceptable method of attack to keep systems from diverging, but not always the pragmatic one.
---
On the subject of what to do when something goes wrong: Sometimes retrying installing a package does fix the problem: if there was a network error, for example, and you downloaded an incomplete set of files, the next time you run it it will be fine.
If your package manager goes off the rails and gets your system into an inconsistent state, then you have a decision to make. Is this going to happen again? If not, just fix the stupid thing manually: there's no point in automating a one-time task. If it is probably recurring, then, you need to write some code to fix it (and file a bug report to your distro!). I do not believe that there is a safe, sane way to pre-engineer your automation to fix problems you haven't seen yet!
In the meantime maybe your automation framework stupidly tries to run the install script every 20 minutes and reports recurring failure. The cost of that is low.
Docker is awesome, for sure, and I'll definitely use it on my next server-side project. It isn't a magic bullet, though - you still have to configure things, they still have dependencies. Just, hopefully, failures are more constrained.
---
and on the point of upgrading for security fixes: the sad reality is that even critical fixes for security holes must be tested on a staging environment. No upgrade is ever really, truly guaranteed to be safe. I guess if the bug is bad enough you just shut down Production entirely until you can figure out whether you have a fix that is compatible with everything.