> 18. Do automated processes that generate email only do so when they have something to say?
> for question 18: how do you know when your notification system goes down?
You monitor your monitoring system. I think 18 is important, noise from automated processes will hide real problems. I worked with a manager who would consistently write cron jobs that run as root (17), that would send out useless emails every day (18). One of the cron jobs sent 500KB - 10MB of text every day, no one will read 10MB of text, so if there is a an error no one will see it. Write your scripts correctly, use --quiet flags and redirect stdout to /dev/null.
The monitoring servers monitor each other (and themselves), they should be in different data centers. You can also use a third-party service to monitor parts of your infrastructure, including the monitoring server. Depending on your needs a simple service like Pingdom could be used.
If you are wondering how to monitor if both data centers go don't at the same time, I'd say for most companies you don't worry about it. 1) The odds are extremely low. 2) You will notice if two DCs go down. 3) That nightly email that says "I'm up" isn't going to help here. 4) Even the free version of Pingdom will alert you when your whole datacenter is down.
There are all sorts of other things to consider with redundant monitoring, but that's the job of a sysadmin -- identifying failure points, assessing risk, etc.
I'm not sure how email would help. I guess you mean the system would email you once a day saying that the monitoring system is working and if you don't see the email you know to check into it. The monitoring system I use has a higher SLA than 24 hours.
Usually folks divide the monitoring work among two servers and each server monitors the other. Or, you "meta monitor"... a monitoring system that just monitors the monitoring system. Then you get a third-party to monitor that. Then it is turtles all the way down.
It sounds like both of you are talking about cloud stuff when thinking about datacenters. I can see how that many layers and cross-checks would be necessary when all you really control is running memory and some pieces of storage, but a lot of that is due to the platform. When you control the actual metal, third-party monitoring services are much less necessary.
For a real DC, when it goes down I get a phone call from a human. I don't have to reinvent that process. If it's my own server room, I use a landline and a modem for OOB "dude you gotta come down here" notifications, a WAV of Woody Woodpecker or something. If the phone lines are down, I look at the newspaper headlines to see what happened.
There's no reason not to set up a standalone monitoring regime. Whether or not you use heartbeat notifications to tell you all is well is a matter of taste, but there is definitely more to maintaining your nines on a daily basis than simply adding more layers of monitoring.