Or even instead of any kind of rm command. mv is less subtle. I tend to prefer
`mv x $(date +%Y%m%d_%s_)x`
where:
%Y - 4 digit year
%m - 2 digit month
%d - 2 digit day
_ - underscore literal
%s - linux timestamp (seconds since epoch)
This ensures that the versions you're 'removing' will be lexically sorted from newest to oldest in a way that is easy to interpret and also works if you need to try more than once in a day.
In case it's not apparent to some, this command moves the directory (or file) called
'x'
to something like
'20170131_1485916040_x'
Then when you're all done (i.e. production is humming and passing tests, no need to ever rush), you can delete the timestamped version, or if space is plentiful, move the old file to an archive directory as an extra redundancy (i.e. as an extra backup, not in lieu of a more thorough backup policy).
This needs to be the first thing anyone who works with stateful systems learns. NEVER rm. mv is insufficient. mv dir dir.bak.`date +%s` has prevented data loss for me several times.
I agree with what you're saying, and this is almost exactly what I do, but when disk space is limited - particularly during time-sensitive situations - this advice isn't very useful. For example, if a host or service is on the cusp of crashing because of a partition quickly filling up with rolling logs, what do you do since mv doesn't actually solve the problem?
At some level you have to run an rm, and you better hope you do it right in the middle of an emergency with people breathing over your shoulder.
In an ideal world, this wouldn't ever happen, but it does. Inherited/legacy systems suck.
You shouldn't let yourself get to that point. Your alerting system should alert you when disk is at 70% or something with a ton of margin. If it's not set up that way, go stop what you're doing and fix that. (Seriously.) If you're running your systems so they usually run at 90% disk usage, go give them more disk (or rotate logs sooner).
And even assuming all that fails and I'm in that situation where I have seconds until the disk hits 100%, I would much rather the service crashes than make a mistake and delete something critical.
If someone is breathing over your shoulder, you can even enlist them to co-pilot what you're doing. Even if they're not technical enough to understand, talking at them what you're about to do will help you spot mistakes.
If you lose a shoe while running across a highway, it's probably not worth the risk trying to get it back.
The disk usage is just one example of many. Actions that you take really depends on what field you're in and, and again, legacy/inherited systems are completely filled with this sort of shit.. You can say that I shouldn't let get it to that point, but you're kinda dismissing the point I'm trying to make that - shit can and will happen, and you need to know how to deal with shit on your toes. There are times when you have to do things that would make most people flip there shit. There are ways to mitigate the risk in emergency scenarios as you say, but when the risk is actually worth it, you tend to do Bad Things because there's no other option.
In my case, it was in HFT where I inherited the infrastructure from a JS developer who inherited it from a devops engineer who inherited it from a linux engineer who inherited it from another linux engineer. It was a complete shitshow that I was dropped into mostly on my own with little warning. To make matters worse, each maintenance period was 45 minutes at 4:15pm and weekends. Even worse, if a server went down at 5:00pm, the company immediately lost about 35k - which was the same for if the trading software went down. When I asked for additional hardware to do testing on, I was told that there wasn't a budget for it. The saving grace there was that there was 23:15h of time to plan during downtime, so an `rm -rf /` would have had nearly identical long term impact as a `kill -9` on the application server.
Mind you, the owners of the company were some of the smartest and most technical folk I've ever worked with and were surprisingly trusting in my ability to manage the infrastructure. The company no longer exists, and not without reason.
Just to show the lunacy of the infrastructure - they had their DNS servers hosted on VMs that required DNS to start. About a month after I joined, we had a power failure. You can imagine how that went..
(That all said, it was the greatest learning experience I've ever had. Burned me out a tad, though.)
I don't mean to sound flippant, but if I walked into the situation you describe, I would immediately walk right back out. That is a a situation set up from the start for failure, and no way would I care to be responsible for it.
I certainly get that we inherit less-than-ideal systems from time to time; I've been there. But I've also learned that every time I get paged in the middle of the night, it's my failure, whether for a lack of an early-warning system, or for doing a bad up-front job of building self-healing into my systems. If I inherit a system that I can tell is going to wake me up at night, I refuse to be responsible for it in an on-call capacity until I've mitigated those problems.
There seems to be this weird thing in the dev/ops world where it's somehow courageous to be woken up at 3am to heroically fix the system and save the company. I've been that guy, and I'm sick of it. It's not heroic: it's a sign of a lack of professionalism leading up to that point. Make your systems more reliable, and make them continue to chug along in the face of failure, without human interaction. If you have management that doesn't support that approach, make them support it, or walk out. Developers and operations people are in high enough demand right now in most markets that there will be another company that would love to have you, hopefully with more respect for your off-duty time.
I once tried to help a company with similar infrastructure insanity recover from a massive failure. Absolutely brutal.
When my team finally got services up and running (barely) after ~18 hours of non-stop work, the CTO demanded that we not go home and get some sleep until everything was exactly as it had been before the failure.
Unfortunately the answer is "it depends on the application". I tend to run stuff with even higher margins: I never expect more than 30-40% disk utilization. Yes, it's more expensive, but I value my (and my colleagues') sleep more.
But it's all just about measurement. Run your application with a production workload and see how large the logs are that it generates within defined time intervals. Either add disk or reduce logging volume until you're happy with your margins. (Logging is often overlooked as something you need to design, just like you design the rest of your application.)
Log rotation should be a combination of size- and time-based. You probably want to only keep X days of logs in general, but also put a cap on size. If you're on the JVM, logback, for example, lets you do this: if you tell it "keep the last 14 log files and cap each log file at 250MB", then you know what the max disk usage for logging will be.
If you can do it, use an asynchronous logging library that can fail without causing the application to fail. If your app is all CPU and network I/O, there's no reason why it needs disk space to function properly. If you can afford it, use some form of log aggregation that ships logs off-host. Yes, you've in some ways just moved the problem elsewhere, but it's easier to solve that once in one service (your log aggregation system) than in every individual service.
If your app does require disk space to function properly, then of course it's a bit harder, and protecting against disk-full failures will require you to have intimate knowledge of what it needs disk for, and what the write patterns are.
It's never going to be perfect. Just a 100% uptime is, over a longer time scale, unachievable, you're never going to eliminate every single thing that can get you paged in the middle of the night. But if you can reduce it to that one-in-a-million event, your time on-call can really be peaceful. And when you do get that page, look really hard at why you got paged, and see what you can do to ensure that particular thing doesn't require human intervention to fix in the future. You may decide to cost of doing so isn't worth the time, and that getting woken up once every X days/weeks/months/whatever is fine. But make that your choice; don't leave it up to chance.
I'm curious, too. Proper disk space management and monitoring is probably the most difficult problem I know of in the ops field. I haven't seen anybody do it in a way that prevents 3am wakeup calls or a 24/7 ops team.
For example, a 3am network blip that causes the application server (still logging in DEBUG from the last outage) to fill up its log partition while it can't communicate to some service nobody monitors anymore. Not sure how you'd solve that one.
> For example, a 3am network blip that causes the application server (still logging in DEBUG from the last outage)
Nope. Don't do that. Infra should be immutable. If you need to bring up a debug instance to gather data, that's fine, but shut it down when you're done. If you don't, and it causes an issue, you know who to blame for that.
> to fill up its log partition while it can't communicate to some service nobody monitors anymore.
Sane log rotation policies (both time- and size-based) solves this. If you tell your logging system "keep 14 old log files and never let any single log file to grow above 250MB", then you know the upper bound on the space your application will ever use to log.
Also, why are you not monitoring logs on this service. If it's spewing "ERROR Can't talk to service foo" into its log file, why aren't you being alerted on that well before the disk fills up?
> ... nobody monitors anymore.
Nope. Not allowed. Fix that problem too. Unmonitored services aren't allowed in the production environment, ever.
I've heard (and given) all the excuses for this, but no, stop that. You're a professional. Do things professionally. When management tells you to skimp on monitoring and failure handling in order to meet a ship date, you push back. If they override you, you refuse on-call duty for that service. (Or you just ignore their override and do your job properly anyway.) If they threaten to fire you, you quit and find a company that has respect for your off-duty time. Good devs & ops people are in high enough demand these days that you shouldn't be unemployed for long.
We just switched over to centralized logging two years ago. All hosts are configured to only keep small logfiles and rolling them over every few megabytes. Filling up the centralized logging is nearly impossible when good monitoring is done and used diskspace is never over 50%.
Btw. using an orchestration platform simplifies many of those aspects of "one node is going rough and I've to do accidentally something stupid".
Monitor the rate at which the disk is filling up, and extrapolate that to when it will hit 90%. If that time is outside business hours, alert early. If current time is not in business hours, alert later if possible.
How does this help in situations where something rogue starts filling the disk? The idea makes sense in theory, but in practice, it doesn't work out that well. Ops work is significantly harder than many devs think..
> Ops work is significantly harder than many devs think
No, it's not (I've done both). Ops is about process, and risk analysis and mitigation. Yes, there's always the possibility that something can go rogue and start filling your disk. That shouldn't be remotely common, though, if you've built your systems properly.
In case it's not apparent to some, this command moves the directory (or file) called
to something like Then when you're all done (i.e. production is humming and passing tests, no need to ever rush), you can delete the timestamped version, or if space is plentiful, move the old file to an archive directory as an extra redundancy (i.e. as an extra backup, not in lieu of a more thorough backup policy).