Something I wish this discussed more was development vs production logs. It talks about a unique request identifier across microservices, but this applies within a single service too.
In your development environment, you're usually only doing one request/job at a time, and so logging each step along the way makes it easy to follow and debug. In production, with tens or even thousands of concurrent requests, this quicky becomes basically impossible.
I've become a huge fan of fewer but "wider" lines. Take sending an email.. I'm sure most people have seen (or even written) logs like:
* New message from me@example.com to you@other.com
* Subject "test 123"
* Generating message using template xyz
* Checking sender against block lists
* Sender block list check passed
* Connecting to SMTP server smtp1.example.com..
* SMTP authentication succeeded, server responded "235 2.7.0 Authentication successful"
* Send succeeded: "250 ok"
This is ok when you're writing the code initially, but much better is a single line (or single structured message):
All the context needed (whether succeeded or not), compact, and concurrent requests are a non-issue. I can't say I'm fully on the "structured log" bandwagon just yet (mostly because I haven't found a tool I like enough to give up the simplicity of a text editor), but I can see the appeal for sure.
But I think you can do better than structlog, with causal/trace-based logging. I built a Python library called Eliot to do so (https://eliot.readthedocs.io).
You can use tracing in a single application just like you can for full distributed systems. This means each log message is part of a causal chain of actions, and you can see what caused it.
I really like the structured / long messages, but they have an issue in comparison. If your sending process fails badly between those phases, you won't know that with a single message. You may not even get any message at all.
Sure, those cases should log something / log exception / something should notice the failure. In practice one day you'll find a case where it doesn't happen.
I wonder if there's a sensible pattern for wrapping a segment in a context that accumulates log calls within until it errors or returns, then concats and dumps.
Probably, but some critical errors like stack overflow or out of memory could prevent the stack from unwinding gracefully or otherwise prevent the accumulated log from being written. And then you’d have nothing.
Conceptually, if you separate out the application generating the log from the one constructing the structure, so that the former is still streaming individual messages with added context information and the latter (probably a log viewing application, so it is completely out of the critical path of recording information) uses context information to provide a structured view, then that's not a problem.
Alternatively, as others have mentioned, log as usual but include an instance id and then use an external service/process to aggregate log entries with the same id together.
That's actually an example, not an alternative (the instance id is the “context information”, the external service/process is the separate component using the context information to construct the structured view.)
I did this for a desktop app I wrote. The liver kept an in memory ring buffer and when there was an unexpected error, it would dump all the previous debug level logs to give context to the error.
One thing I try to be careful of, especially in your particular example, is to prefix log lines with session IDs. If my app is sending multiple e-mails simultaneously, I want to know which lines correspond to which sessions if one of them went south, because they're all likely to be jumbled up.
This is doubly true for logging multithreaded processes.
I'm now pushing application logging into a Kinesis Stream that gets firehosed to both S3 and Elastic Search. This includes both request logs and logs of any CRUD operation on my models (including all changed attributes with their old/new values.) There are separate logs for each, but the CRUD events get tagged with the ID of the request that triggered them so it is now trivial to see everything that happened to a given record and which request caused it and which user did it. And since all request logs have the session ID, I can also easily see everything that user did during that session both before and after the event in question. I've been wanting this sort of logging solution since the early 2000s and I now have it and it is even more incredible than I imagined.
Any blog posts or talks that inspired this? How do you handle retention? Any libraries or tools that helped with it?
I can see the use of such logging clearly, but I work at a young startup and implementing that ourselves sounds like infrastructure overkill. Maybe I'm overestimating the work?
Sorry, just saw your reply. I suspect you are overestimating the work. I just added some Rack middleware to handle request logging. Then plugged into the Rails model callbacks to record the CRUD events into a different Kinesis stream. And we're a 3 dev shop and I did this on my own over a few weeks part-time.
As far as retention goes, I just dump to it to S3 for permanent storage. Kinesis Firehose makes that dead simple. I also push each event into AWS hosted Elastic Search and I could theoretically purge older data, but for our volume of data, I'll likely retain it there for years.
Edit: Also, I generate UUID in the Rack middleware and set a Thread Local Storage variable that I can pull out in the CRUD logging so that I can always see what Rack request caused the CRUD model event.
This is what I've done and it works pretty well. You can usually filter out just the "bad" session IDs pretty easily and recover a more or less chronological explanation of what happened.
The problem I see there is that a serious issue party way through the process my man nothing gets logged. Also for a long running process you won't see the line until it is complete - this could even apply to what should be a short process when something is wrong.
A compromise might be to log each part individually as is more usual, but with IDs appended/prepended to each line to group them (which is useful anyway for analysis without this next step) and later process the raw log to produce one with the structured entries.
This gets to one of my... pet peeves? hobbyhorses?..., which is the importance of the APIs you use. The general logging API is something like LogMessage(level, message, params....), assuming some sort of printf-like functionality. A logger object is probably passed around.
This is not a very good API, though, because it tends to afford that sort of logging, on two levels. First, it tends to encourage a lot of little individual lines dumped out, and second, it tends to make it difficult to have any additional context in the log messages. And you very often care about context in log messages. Log messages that look great during dev because you only ever tested one at a time become impossible to understand in production.
You really want an API more like msg = log.GetMessage(), where when "msg" falls out of scope it writes the log line (or when it commits), and you update msg throughout the operation and generate just the one line at the end. You can even make "log" itself something that carries context with it, for the microservice labeling case, and emits "msg"s that already have that context in it.
(Similarly, anytime I start using a cache I always put my own wrapper around it, because the correct interface for a cache layer is not "cache.get(key)", it's "cache.get(key, func() value)"; that is, a get request should be tied to "the code to generate the value". Otherwise, there will eventually be some code path that loads from the cache but doesn't handle the case it's missing, or some other mismanagement brought on by the fact you've detached the generation of the value from the use of the cache. Exceptions as always, but this is a much better default.)
Regarding the need for a tool, I can't say enough good things about [lnav](https://lnav.org). It's like a mini-ETL powertool at your fingertips, w/ an embedded SQLite db and a terrific API. As of mid-2016 when I first used it, querying logs was extremely easy, and reasonably fast (w/ up to several million rows). Highest recommendation.
Many times I've created a log buffer (list of strings) that will get created as part of the unit of work that will blast everything to the actual logger at once so you have all the context surrounded by a "* BEGIN REQUEST *" type message. This keeps the simplicity of plain text with just enough structure to process if needed.
Missing is probably the the most important thing about logging: be vigilant and never log sensitive user information or secret values like password reset tokens. Quite a few cases of compromised secondary/tertiary logging infrastructure leading to exploits of primary systems.
"be vigilant" seems to be the mind set, but I can't help but feel a system that protects from accidentally logging secrets is what is needed. Due to the varied way people log and what they log and when, it is likely unsolvable in the general sense.
We have structured logs (serialized objects as json that auto-get default fields like app name, timestamp, and such). These structures can have fields tagged as sensitive, and those values are redacted. This is still human error prone.
> …a system that protects from accidentally logging secrets…
That ecosystem exists. There are SW tools, both on-prem and (less interestingly) cloud/SaaS, that sit in the middle and look for personal and sensitive data. You plug into their API, and then get to decide what to do when such information is detected: ignore, quarantine, redact, alert…
Tangential: With the recent screw-up of Facebook logging user passwords in cleartext in (what I presume to be) this type of logging, I thought (before clicking) that an article named "Application-Level Logging Best Practices" would say something about stripping out passwords etc. when logging anything that involves user inputs. In fact, I thought it was what prompted this post. :)
I would like to stress the importance of buffering and thread safety in your logging functions. A logger with buffering will far outperform just opening a file, writing and closing, which is default for many loggers.
Buffering is great as long as you can rely on the buffer getting flushed to disk in the event that the application crashes. Even more so if you're doing some kind of embedded development (which I realise is not the usual bread-and-butter here) where your system can trip a circuit breaker or be reset by a watchdog timer.
One tactic is actually to keep the log buffer in RAM and then you can dump it from the core file. You can often have a RAM only buffer which has lower level logs than that which you can log to disk.
I absolutely love this approach and have used it successfully in the past, but there's a caveat: I have only encountered a handful of people in my career who know or care what a core file is, and even fewer who know how extract data out of one.
If you're building something that's going to primarily have strong systems people working on it, it's a great approach, especially for ephemeral state logs and debugging. If the "general public" is going to be trying to extract logs out of it, it's probably not very ergonomic for them.
Edit: taking it a level deeper, this is used in the Linux kernel during the early stages of booting. When I was doing bring-up on a board, I wasn't getting any console output when trying to boot the kernel. What I could do is let it sit for a bit, reset, and dump the kmesg buffer from u-boot to see what the kernel had been doing. Absolutely beautiful backdoor when trying to debug broken serial output.
There's an article called something like "The logging system you can't have" which talks about using a similar thing. In the case of a panic, the device would restart and find the buffer in memory, allowing you to recover crash data across boots.
From memory they then pointed all logs as /dev/kmsg rather than /var/log/syslog or equivalent logger, so that all userspace logs were buffered immediately into memory. I think that the system logger then read from the buffer? Wish I could find the article.
It's fairly easy to write a tool to collect the core file once it's dumped, and also fairly easy to write a tool which wraps gdb and prints out the buffer.
...and make sure logging code is safe, e.g., no NullPointerExceptions, OutOfBoundsExceptions, etc. (Dynamic languages make this more of a challenge) I've seen too many systems where customers are affected because a logging routine had failed.
In a massive environment you would find that logging is almost needless and metrics take control. Then if you have an issue you take recording of events and replay on dev env instead of checking logs.
Metrics are great. In general, I subscribe to "don't log successes" (with the caveats such as log events tied to billing or that get passed as messages to other services).
But what if your problem only occurs at scale? You won't be able to replicate that you in your dev env. Error logs and event logs make sense in many (most?) services.
Log synchronously. Either exploit the POSIX propensity of buffering lines, have an explicit synchronization on your IO routine (that doesn't need to stop the business code), or log each thread to a different stream.
Why would logging not help in an async flow? Link log entries with request IDs and consider structured logs. I analyze logs in massively concurrent systems all the time.
log errors/exceptions at the error level, and nothing else
make judicious use of trace, info, debug, etc. levels
properly namespace loggers so that the log level can be set
appropriately depending on the environment and current needs
get the fundamentals right - if you are logging an exception, do you actually pass the exception to the logger? Same goes
for things like failed API calls. Don't just log "API call failed". Log the entire response including status code and messages.
In your development environment, you're usually only doing one request/job at a time, and so logging each step along the way makes it easy to follow and debug. In production, with tens or even thousands of concurrent requests, this quicky becomes basically impossible.
I've become a huge fan of fewer but "wider" lines. Take sending an email.. I'm sure most people have seen (or even written) logs like:
* New message from me@example.com to you@other.com
* Subject "test 123"
* Generating message using template xyz
* Checking sender against block lists
* Sender block list check passed
* Connecting to SMTP server smtp1.example.com..
* SMTP authentication succeeded, server responded "235 2.7.0 Authentication successful"
* Send succeeded: "250 ok"
This is ok when you're writing the code initially, but much better is a single line (or single structured message):
* Outgoing mail message. from="me@example.com" to="you@example.com" Subject="test 123" blocklist=passed template=xyz smtp_server=smtp1.example.com smtp_auth="235 2.7.0 Authentication successful" smtp_Status="250 ok"
All the context needed (whether succeeded or not), compact, and concurrent requests are a non-issue. I can't say I'm fully on the "structured log" bandwagon just yet (mostly because I haven't found a tool I like enough to give up the simplicity of a text editor), but I can see the appeal for sure.