If you log everything you will probably get too many log entries to have a meaningful reaction to exceptions in your log.
In theory, this makes perfect sense. In reality, you will almost always wish you had logged more. Having good detailed data will enable you to discover patterns when you can't recreate problems.
Disk space is cheap. Log it and archive or dispose it later. You can't see what you don't save.
Generally you're right. To know what will be helpful for debugging you have to know what will go wrong, and if you know what will go wrong you may as well fix it now instead of after it affects a user. Also, in any sufficiently stable system errors become very hard to reproduce so you may not have the luxury of reproducing the error before finding the root cause of the problem.
Large log files don't scare me for I have The Shield of Grep, and The Sword of Awk.
The only issue I've seen with logging everything is that some file systems have troubles with files larger than 2gigs, and changing file systems may not be an option. Depending on how quickly you rotate your logs you can hit that limit and stop logging messages or cause errors elsewhere. With a log rotation period of 1 hour I've hit this limit, and it's rather annoying.
A better idea may be to not log to file. If you were to log to a database you could record things like date/time, session_id, service_name, module_name, and a generic message and do clever SELECTs on your database of events/messages and it would be easier to answer questions like 'Is this exception clustered with time?' or 'is module X ever called in a session that results in an error between noon and 1pm?'
Cleaning old messages is potentially easier as you can remove lower priority messages first, or simply delete any message older than X.
Of course, if you're logging to another remote service (like a DB server), how do you record low level events in communicating with that service, or how do you record events when the db server is down?
No real reason except that it works the way it does. It would take time to change it, and there is more pressing work to be done. That, and I don't work for the company that had/has this problem anymore.
Another option to consider when encountering a fatal error is to save the process image to disk. In Lisp, you can fork, start Swank, and save-lisp-and-die; in other languages, you can fork and cryopid yourself. Then, you can start your program up in the exact state that caused the error, and then gather the information you need to fix the problem.
I have not had the occasion to try this yet, but I know other people have found this approach to be useful. Fatal errors shouldn't be all that common, and this will give you the information you need to really fix the problem.
Nice to know about this approach, but I'm more concerned about non-fatal errors. They are the vast majority and they continue to corrupt your data for months, even years. These are the ones that are so hard to find and reproduce.
I've always thought that if something was wrong, I'd rather have it crash and burn that to continue to run unnoticed.
Decent article overall (fairly short and not particularly controversial though).
One item I would argue with is the second half of this:
> You need to decide what critical means for you (most often it means losing money).
Just because you're losing money doesn't make it critical, at least not for any site that operates at scale. If you log as critical everytime you knew you're losing small amounts of money, you'll also have log files full of tiny taxes you've paid. "Can't reach US credit card payment provider" probably sounds critical, but isn't all that critical if you can place that order into a stored state and retry it with the provider later. Sure, that loses money because some of those payments will fail, and some users won't retry after an email prompt, but that doesn't make it critical.
In my experience (on the dev side for years and now on the operations side), developers log errors at one level more severe than is actually warranted. ("Divide by zero" is almost never a fatal error from the perspective of the site operations team. :) )
Log everything, and then filter out the information you know won't be interesting with rules. The remainder is what's interesting (if it isn't, add to your rules to filter it out).
This way, if later on you're trying to track something which requires information you usually don't find interesting, you just update your rules and you have all you need.
In theory, this makes perfect sense. In reality, you will almost always wish you had logged more. Having good detailed data will enable you to discover patterns when you can't recreate problems.
Disk space is cheap. Log it and archive or dispose it later. You can't see what you don't save.