This article needs a giant warning on it. Automating is brilliant, but an automated process is a liability. If no one is regularly checking the output of something that's been automated then you can't know if it's broken, and that could be catastrophic. Every story of "the backups had failed", "payments had been missed for months", or "inventory wasn't where it was supposed to be" is the story of an automated process no one was checking.
If you automate anything you need robust error reporting (which is not an email someone will ignore).
Every developer should take this advice to heart. If a customer asks for 'a daily email report', you should strongly advise against it.
- Most reports the customer asks for will be meaningless vanity metrics anyway
- If you only do error reporting (not sending an email if everything goes smoothly), you won't notice it if email is not being delivered.
- If you report everything (not just the errors), people will stop reading them
- A mailbox history is not a log
- you'll get a request just about every week to change the recipient list of the report.
- Who reads the reports on weekends?
I could go on with arguments for a while, but the short story is this: email is not suitable for logging and reporting. In fact: email is not suitable for a lot of things. But customers will ask you anyway, because that's the tool they know (if you have a hammer, all problems will look like nails).
My advice:
- Log everything to disk when possible (syslog, pipe)
- use a logging aggregator to filter and archive (fluentd, ELK, graylog)
- use an exception tracker for exceptions (sentry, raygun)
- make incidents actionable by creating tickets automatically (jira, zendesk, slack)
- if needed, use an incident response service (pagerduty, opsgenie)
>make incidents actionable by creating tickets automatically
not really a serious concern, but what happens when ticket automation goes down? If all the problems are auto-reported, in a form of tickets or tasks, then if that somehow breaks down it might take a moment before anyone notices, especially if reported exceptions are rather rare. If it is normal to have tens of tickets every day, then problems with ticket creation will be noticeable almost instantly, if they are very rare - then not so much
The second issue with automated exception tracking is that you loose the "huh, this is weird" mechanism that works when actual humans go through logs or reports. While any tool will of course be orders of magnitude faster and also probably more accurate, by relying solely on such automation an opportunity to notice some "weird"/not-typical entries or rare/unexpected sequences of those might be missed. Then again in most cases - I guess - simple statistical analysis might be a good substitute. And that can be automated.
(edit: formatting)
Don't let the perfect be the enemy of the good. 99.9XXXX reliability is good enough. Eventually have a enough nines and your risks are things like "nuclear war", "dinasour-killer sized asteroid hitting the earth", et cetera.
Agreed, there is absolutely no point going further after reaching a certain reliability level. However, one thing is eliminating risks, the other is limiting the consequences of said risks. I strongly prefer 99.9% reliability where that 0.1% means some insignificant problem over 99.99% reliability where the remaining 0.01% means total disaster.
My point is that doing "too much" automation gives diminishing returns (which is not bad in itself), but might also disproportionately increase the consequences of that 0.xxxx1%
<cynical-response>
So you are telling me by adding a simple email tool, I can make the customer happy because I'm giving them what they want and not have to set up 5 tools and continue to pay them monthly?
</cynical-response>
It's your job to advice them to use the proper solution.
But the customer ultimately pays you, so if they really want an email monitoring solution, you should built it for them.
Also, the cost savings by automating the task should outweigh the monthly cost of the tools that are required to run it. If not, the task is probably not worth automating.
we may have plateaued by now, but so far it's still a net gain - a new person comes in, learns things, often by reverse engineering, often improves the automated process, sometimes leaves etc.
But what we do need is a better way to measure automation gains and losses - basically any process that is automated shows up as a net loss in human productivity, which is simply not true.
And vice versa any automated processes probably don't account correctly for the total energy and resource use and the cost of maintenance (both software and hardware).
Automating doesn't mean "do half the job". Just like running the backup script/routine manually means checking the output and making sure the backup is useable, automating that task means performing the same checks. Sending an e-mail automatically on failure shouldn't be the issue...
We'd like to think that a process being manual means someone is paying attention. That's not always the case. You still need to be sure that checking is actually happening, whether the process is automated or manual.
Proper governance mechanisms is the way to go. It's one of the core aspects of the Monetary Authority of Singapore's FEAT regulation and I fully believe it makes sense.
If you automate anything you need robust error reporting (which is not an email someone will ignore).