RFC3339 gives a more tightly defined subset of ISO8601, which is even more useful and lacking things like "week of year" which have little use on modern systems:
More like the author was lazy and not thinking about correct ordering of the date there. It should absolutely be yyyy-mm-dd. I'll be updating the post w/ feedback from the comments sometime today or tomorrow. So expect to see that change.
It uses thread-local storage and RPC library instrumentation to thread a fixed request-id through every RPC call and thread handling a particular user request.
Then you can do things like automatically adding the request ID to log messages, transparently to the request-handling code. And when you visualize logged RPCs, you can include (some subset of) the logged messages in your visualization.
One of the hardest things in ops is trying to see what the app is doing without benefit of the code to review.
I know what the app does, I know the process flows involved but 'failed to covert X at MyClass:44' doesn't have enough context for me.
Also log levels - most developers over use warning and error levels and under uses debug and trace in my experience. It makes filtering out the noise harder.
Rolling log files - roll on an event (new day, process bounced etc.) not on reaching a 10Mb limit or an equally arbitrary 1k lines limit.
Use a header in the logfile, contain at least the version number of the code running, the user and host it is running on.
Lastly stack traces, if they even have a use case at all in production, should not be in the main logfile.
Lastly stack traces, if they even have a use case at all in production, should not be in the main logfile.
I would have to disagree with this one. There's no way to generate the stack trace after the fact if a particular bug is extremely rare; having a complete call chain makes fixing the problem much easier.
Someone once said that removing error handling once the code is working is like removing the landing gear once a plane is airborne.
roll on an event (new day, process bounced etc.) not on reaching a 10Mb limit
Only applicable when you produce a small, predictable volume of logging. Not a good general guideline because often you don't know how big your system may grow in the future.
When logging high volumes then fixed chunk sizes are usually preferable (e.g. so you know how many you can keep around without a disk running full). Large sites produce multiple gigabytes of logs per hour.
Use a header in the logfile
Disagree. Context should be provided directly in each individual line. Logs get sliced and diced in analysis, fetching an arbitrary header from the right file for a given context can be rather painful.
Don't be afraid of long lines. Use field delimited formats (TSV) so they can be easily parsed (either via 'cut' or a small script).
stack traces ... should not be in the main logfile
Wrong. When a serious error occurs then as much context as feasible must be provided. Stack traces are usually very feasible and indispensable.
>> Only applicable when you produce a small, predictable volume of logging.
No, It's primarily in the larger log volume systems we run into this problem. Take a FIX session adapter (very common software in the financial services world), baseline is around 5Gb/hour but its susceptible to fairly extreme fluctuation.
When you roll the logs at say 500Mb, there's the chance our monitoring system misses an error in the tail of the previous log as the logs roll (which could in the worst cases mean loosing a couple of million or even more).
Multiple logfiles for a day are harder to search through when handling the most common use case for log files, which is not debugging but ops staff answering questions on 'what happened with X?'
What benefits does a split logfile provide? The one use case is transferring logs across the network. However most log management systems already split large logs up before transferring to the archive silo.
You're loosing a lot, but gaining nothing with split logs.
>> When logging high volumes
It makes no difference, when the market swings you're going to need an extra 200Gb anyway, whether the logs are in chunks or not. In large volume apps deleting to make space is often not an option - you're logging in high volumes for a reason (often regulatory).
>> fetching an arbitrary header from the right file for a given context can be rather painful.
That's been a solved problem for a very long time. Just as recombining a separate trace log with a main app log based on time stamp is a solved problem. I believe this highlights the bigger problem, you can go to university or a.n. other school and learn to be a developer. Where can you go to learn operations? The net result is people come up with the own solutions each time instead of building on work already done. That work is available but its often passed on by word of mouth. Surprisingly hard even in today's world to find decent blogs on ops work (thankfully there are more around now).
>> When a serious error occurs then as much context as feasible must be provided
Yes this is the extraordinarily rare case where stack traces can be useful. They still don't belong in the main app log though.
Push to a separate trace log and recombine on the fly when you need to view. The benefits are to the ops team and the monitoring system.
Another approach not often mentioned is black box logging. Log to a circular buffer in memory, the first bytes of the buffer are set to a pattern that can be searched for in a core dump, the remainder is used as a circular buffer. It's fast to log to, takes no disk space and provides crucial context in the event of a crash.
Im not sure I'll be able to convince you based on your tone, but speaking from experience, you are making the most common mistakes.
When you roll the logs at say 500Mb, there's the chance our monitoring system misses an error
Sorry but then your monitoring system is broken?
Why is its operation tied to log chunking?
Multiple logfiles for a day are harder to search
Not really. Often enough interesting events span multiple days (or cross midnight) anyways. Thus regardless of chunking you have to be prepared for that case.
What benefits does a split logfile provide?
I'll give you that "one file per day" seems easier at first, but I've been bitten by that too often. A script to identify the file-range for a given time-range via binary search is really easy to write, has only to be written once, and the benefits of fixed chunks are:
Easier handling. If you need to offload "50GB" from a given log-host then that's easier to do when you know that will be 50 files.
You don't risk a scratch disk running full (i.e. those small and expensive spindles that take the initial load) when the application decides to have a bad-hair day and put out 10x the normal volume for a while (e.g. interleaved with stack traces).
>>fetching an arbitrary header from the right file for a given context can be rather painful.That's been a solved problem for a very long time. Just as recombining a separate trace log with a main app log based on time stamp is a solved problem.
I said painful, not impossible. You say you're dealing with 5GB/hour logs, then you must be well aware how long even the simplest merge-job takes on the slice of a single day?
>> When a serious error occurs then as much context as feasible must be provided
Yes this is the extraordinarily rare case where stack traces can be useful.
We must be living in different worlds then.
When errors pop up in the main logfile, which happens frequently, then the last thing I want is to stitch together context from other files.
A mere grep on a day's worth of logs takes about 10 minutes for us. Any kind of text-processing or "for each line perform lookup in $otherplace" quickly pushes that into the hours.
>> Sorry but then your monitoring system is broken? Why is its operation tied to log chunking?
Yours clearly bends some unbendable truths of the known universe ;-)
A monitoring system ordinarily samples the logfile at a given point in time, that is an open followed by a close operation on the logfile (lest you run into phantom disk usage issues caused by holding an open file handle to a compressed / rotated logfile).
If between sample 1 and sample 2 the log file is rolled, your monitoring system never sees the data between sample 1 and the end of the previous log.
Possible workarounds include logfile streaming (say via a pipe on the filesystem) but that introduces much larger problems and would in no way be compatible with the app rolling its log files.
A much simpler fix is to roll only on events (bounced the app etc.)
>> A script to identify the file-range for a given time-range via binary search is really easy to write
Contrast with no script required. No bugs, no maintenance time, no different versions in different regions. Just no script. The benefits don't stop there - when you need to query, there's no waiting for a bunch of disk seeks while the script runs against a busy disk controller - you can just get straight on with parsing the results.
>> Easier handling. If you need to offload "50GB" from a given log-host then that's easier to do when you know that will be 50 files.
Moving 50 items is easier than moving 1? I'm unconvinced but still open to any interesting ideas here.
>> You don't risk a scratch disk running full
There is no change to the risk profile, neither approach is better in this regard, unfortunately.
>> e.g. interleaved with stack traces
Just one of the many excellent reasons not to log them in production.
>> you must be well aware how long even the simplest merge-job takes on the slice of a single day?
Yes - it runs at full disk I/O speed with almost no CPU overhead. Remember we are merging sorted data (already in temporal order).
>> which happens frequently
It's clearly a high profile app you're working on (in my experience the people in charge of the pennies refuse to sign off on the cost of disk space for this amount of logging without solid reasoning) i'm surprised it behaves as badly as you say. Concerning, but then it also appears you don't have an ops team either.
>> stitch together context from other files.
I get the impression you're thinking that you should manually stitch these together? On the very rare occassions you need this data, remember both are already in time order, we just instantly merge on the fly.
>> A mere grep on a day's worth of logs takes about 10 minutes for us.
It sounds like you've got it all in one file, i.e. not splitting your audit.log out from your error.log from your trace.log? - techniques which I couldn't do without to be honest (the monitoring system would have to consume every byte logged otherwise, when it really only needs to see a subset of total log output - thus saving CPU time and disk seeks).
>> Any kind of text-processing or "for each line perform lookup in $otherplace" quickly pushes that into the hours.
shakes head
Would you humour me with some context here. What is your background? From your replies i would say you are non-ops staff[1], self-taught[2] and with limited experience[3].
[1] You don't demonstrate any knowledge of basic tools, e.g. you mentioned writing a script for a problem solved by a 35+ year old command found on any unix host
[2] You haven't volunteered anything more advanced than obvious / first order approaches. Approaches that i'd expect any CompSci university grad could suggest off the cuff with no prior experience.
[3] You mention grepping through a single logfile of at least 35Gb, likely much larger.
Possible workarounds include logfile streaming (say via a pipe on the filesystem) but that introduces much larger problems and would in no way be compatible with the app rolling its log files.
What you call a workaround (streaming to a central location via syslog or scribe) happens to be the standard approach in my corner of the world. Analytics and monitoring operate naturally on the stream because, as you point out, sampling on rolling logfiles is not exactly reasonable, neither is scattering logfiles across application servers.
I'm leaving this discussion at this point because I'm not interested in your condescending tone and insults. It appears you haven't even centralized your logging, yet feel entitled to give ops-advice that is at odds with how the rest of the world operates.
you're mixing up a separate concept - streaming to a remote host vs streaming via a pipe on the filesystem
You said above, quote: but that introduces much larger problems and would in no way be compatible with the app rolling its log files.
Why would you have log-pipes, or log-files on your app-servers to begin with?
Syslog streaming has been a common approach since the ~70s i'd guess?
Then how come you're not doing it?
Hopefully, you see the irony in this.
All I'm seeing is a constant stream of arrogance that doesn't seem to be backed up. Also, the command that you wanted to helpfully point out is called "mkfifo" or "mknod".
UTC is great for timestamps, but he needs to make his times sortable (YYYY-MM-DD)
Tab delimited fields are a problem, unless you escape the tabs inside the data you log.
I'm a huge fan of JSON logs now, having just finished implementing them on a major work project. Every language out there can take JSON in and do work against it, and there are some really awesome tools for rendering filtered logs to a browser in realtime.
How do you differentiate between tabs that the logger generates, and tabs inside log messages? You need to escape the tabs inside messages. This problem exists with any character delimited format (e.g. csv).
In fact, they compress to about a 3% of their size. Since they really only exist in text file form for a limited period of time before being moved on to a separate logging system, their exploded size isn't much of a concern.
time is expressed in seconds? My computer does about 3 billion things every second. I think milliseconds or preferably microseconds would be much better. Different domains have different needs, though, I'm sure. Also, I'd prefer the log specify the time zone it thinks it's in, rather than just convert to UTC automatically.
pid - yes, knowing the process id is good, but knowing the thread is also useful.
version - outputting this on every single log line is probably overkill. Ditto for host name.
A caveat here as well is you also need to make sure your various clocks in the data center are synched. This is one of those common but overlooked scenarios which lead to hilarious discussions about latency when really there was just a clock synch problem. Additionally the fallacies of distributed computing come into play here (namely, transport cost is 0) so you need to take extra care when looking at measurements across different systems and machines. Even machines in the same data center. The pipe that you are communicating across could be oversubscribed for example.
Thanks for the feedback. I'm with you on the milliseconds and threads bits. As for hostname / version... how are you getting those into your logs? I've found having version in the logs very useful (and wished I had it when it wasn't there).
Couldn't you output the host and version at the beginning of the log? Any tool that transforms the logs (splicing together logs from different servers) should include these bits in it's output.
Also, why don't you use a sortable form for time? (With the most significant parts first.)
My issue with host and version at the beginning of the log is that they'll roll away when your logs rotate. If you've got a service that's rarely taken down, anything written to the front of the logs is likely long gone by the time you have a problem. That said, there may be a better way to handle these two. I certainly don't like repeating information if it's not necessary.
I don't think outputting the version is overkill at all. It's much easier and simpler to deal with it this way. Probably the nanosecond business is very much overkill, that only matters for a smaller subset of problem domains.
Whether the version is overkill depends on the frequency of version changes. If you're doing careful corporate releases with staging and QA on a two-month basis, it's definitely overkill - if you're doing continuous deployment with several releases daily and different systems running different versions concurrently, it's probably valuable.
Most important point: any log message that doesn't identify some property of the request that originated it (so that you can identify requests leading to erroneous conditions) is cause for immediate defenestration.
Agreed. We use mod_unique to add a header to each response and make sure it's included in each error log, back end access log and front end access log as well as being viewable via Charles/fiddler at the client. Being able to tie the various logs from different machines together has been invaluable.
This link from a talk by Jordan Sissel has some great information on logging and information that's commonly missed in log files (even in some very large products)
I personally have found a lot of difficulty with JSON-based logs, because tools like `grep` and `cut` don't suffice; I usually have to write a more complex regex with `sed` in order to extract useful information I want.
Anyone have a good utility for managing JSON in log files?
RecordStream is a great solution for manipulating JSON log files: https://github.com/benbernard/RecordStream - makes it easy to perform relational-style operations, such as grouping, counting (recs-collate), and even joins (recs-join). It can also display and transform the logs in a variety of ways (recs-tocsv, recs-totable, recs-tognuplot for graphing).
There are also tools to extract records from other data formats such as CSV or by querying a database.
All in all, a fantastic tool for manipulating and querying complex data from the command line. I use it regularly.
underscore-cli is utterly brilliant for this, even better — dare I say it? — than traditional log files with cut/grep/awk. Being able to treat logs as a dataset that can be mapped/reduced/flattened/grouped/otherwise analysed is life-changingly good.
About the strongest endorsement I can give is that it makes me wish more of my logs were in JSON format, rather than wishing that less were as I always had before.
Recs (therecordstream.com) is great for this. A JSON log file would be interpreted as its native record format, so you can use all of its processing tools (like recs-grep and recs-collate and recs-sort).
I reviewed my some of my code and figured out that I'm one of this terrible persons, which haven't spent enough time to show an useful output. I should fix this now ;) A useful log output can be a real time saver. I also seen applications which just showed "an error occurred", so I had still no idea what causes this error message.
In my experience the measure of good logging is the ability for someone, either yourself or an operations engineer to be able to easily solve problems using only the logs. They shouldn't have to refer to the source code. Developers should be using this as the yardstick when asking questions about their log output.
So how do you implement this? In your Rails apps, what kind of library do you use to write log messages in this format? Is it an open-source library or one you wrote just for your company?
Please use YYYY-MM-DD. It's sortable and more intuitive for everyone outside of the USA. See also http://en.wikipedia.org/wiki/ISO_8601