Application-Level Logging Best Practices

gregmac · on March 27, 2019

Something I wish this discussed more was development vs production logs. It talks about a unique request identifier across microservices, but this applies within a single service too.

In your development environment, you're usually only doing one request/job at a time, and so logging each step along the way makes it easy to follow and debug. In production, with tens or even thousands of concurrent requests, this quicky becomes basically impossible.

I've become a huge fan of fewer but "wider" lines. Take sending an email.. I'm sure most people have seen (or even written) logs like:

* New message from me@example.com to you@other.com

* Subject "test 123"

* Generating message using template xyz

* Checking sender against block lists

* Sender block list check passed

* Connecting to SMTP server smtp1.example.com..

* SMTP authentication succeeded, server responded "235 2.7.0 Authentication successful"

* Send succeeded: "250 ok"

This is ok when you're writing the code initially, but much better is a single line (or single structured message):

* Outgoing mail message. from="me@example.com" to="you@example.com" Subject="test 123" blocklist=passed template=xyz smtp_server=smtp1.example.com smtp_auth="235 2.7.0 Authentication successful" smtp_Status="250 ok"

All the context needed (whether succeeded or not), compact, and concurrent requests are a non-issue. I can't say I'm fully on the "structured log" bandwagon just yet (mostly because I haven't found a tool I like enough to give up the simplicity of a text editor), but I can see the appeal for sure.

itamarst · on March 27, 2019

For Python, structlog (https://www.structlog.org/en/stable/index.html) is a good tool for what you want.

But I think you can do better than structlog, with causal/trace-based logging. I built a Python library called Eliot to do so (https://eliot.readthedocs.io).

You can use tracing in a single application just like you can for full distributed systems. This means each log message is part of a causal chain of actions, and you can see what caused it.

Log messages can still be viewed as individual disconnected messages, but you can also get a tree view (see https://eliot.readthedocs.io/en/stable/quickstart.html for the code that resulted in these logs):

    b1cb58cf-2c2f-45c0-92b2-838ac00b20cc
    └── check_links/1 ⇒ started
        ├── timestamp: 2017-10-27 20:42:47.206684
        ├── urls:
        │   ├── 0: http://eliot.readthedocs.io
        │   └── 1: http://nosuchurl
        ├── download/2/1 ⇒ started
        │   ├── timestamp: 2017-10-27 20:42:47.206933
        │   ├── url: http://eliot.readthedocs.io
        │   └── download/2/2 ⇒ succeeded
        │       └── timestamp: 2017-10-27 20:42:47.439203
        ├── download/3/1 ⇒ started
        │   ├── timestamp: 2017-10-27 20:42:47.439412
        │   ├── url: http://nosuchurl
        │   └── download/3/2 ⇒ failed
        │       ├── errno: None
        │       ├── exception: requests.exceptions.ConnectionError
        │       ├── reason: HTTPConnectionPool(host='nosuchurl', port=80): Max retries exceeded with url: / (Caused by NewConnec…
        │       └── timestamp: 2017-10-27 20:42:47.457133
        └── check_links/4 ⇒ failed
            ├── exception: builtins.ValueError
            ├── reason: HTTPConnectionPool(host='nosuchurl', port=80): Max retries exceeded with url: / (Caused by NewConnec…
            └── timestamp: 2017-10-27 20:42:47.457332

photonios · on March 27, 2019

I always thought about building something like this. Excellent work! Even serializes to JSON. Really nice.

It would be nice to integrate it with Python's standard logging system similar to how structlog does this.

itamarst · on March 27, 2019

It can take messages _from_ the standard logging library, and they even show up in the correct place in the tree: https://eliot.readthedocs.io/en/stable/generating/migrating....

This is the preferable method in order to get the semantic model of tracing for all logs.

anitil · on March 28, 2019

That is a fantastic library!

viraptor · on March 27, 2019

I really like the structured / long messages, but they have an issue in comparison. If your sending process fails badly between those phases, you won't know that with a single message. You may not even get any message at all.

Sure, those cases should log something / log exception / something should notice the failure. In practice one day you'll find a case where it doesn't happen.

Waterluvian · on March 27, 2019

I wonder if there's a sensible pattern for wrapping a segment in a context that accumulates log calls within until it errors or returns, then concats and dumps.

BonesJustice · on March 27, 2019

Probably, but some critical errors like stack overflow or out of memory could prevent the stack from unwinding gracefully or otherwise prevent the accumulated log from being written. And then you’d have nothing.

dragonwriter · on March 27, 2019

Conceptually, if you separate out the application generating the log from the one constructing the structure, so that the former is still streaming individual messages with added context information and the latter (probably a log viewing application, so it is completely out of the critical path of recording information) uses context information to provide a structured view, then that's not a problem.

dkersten · on March 27, 2019

Alternatively, as others have mentioned, log as usual but include an instance id and then use an external service/process to aggregate log entries with the same id together.

dragonwriter · on March 27, 2019

That's actually an example, not an alternative (the instance id is the “context information”, the external service/process is the separate component using the context information to construct the structured view.)

dkersten · on March 27, 2019

Ah, good point.

viraptor · on March 27, 2019

This is a popular concept. See https://www.npmjs.com/package/bunyan#raw--ringbuffer-stream for example. Many languages have an implementation for something like that.

Osiris · on March 27, 2019

I did this for a desktop app I wrote. The liver kept an in memory ring buffer and when there was an unexpected error, it would dump all the previous debug level logs to give context to the error.

pletnes · on March 27, 2019

I suppose you could concat strings and do it in a finally clause? Or use python’s context managers and do the logging in the __exit__().

Waterluvian · on March 27, 2019

Exactly my thinking. And the context wraps the view call and therefore the whole transaction.

sdrothrock · on March 27, 2019

One thing I try to be careful of, especially in your particular example, is to prefix log lines with session IDs. If my app is sending multiple e-mails simultaneously, I want to know which lines correspond to which sessions if one of them went south, because they're all likely to be jumbled up.

This is doubly true for logging multithreaded processes.

griffordson · on March 27, 2019

I'm now pushing application logging into a Kinesis Stream that gets firehosed to both S3 and Elastic Search. This includes both request logs and logs of any CRUD operation on my models (including all changed attributes with their old/new values.) There are separate logs for each, but the CRUD events get tagged with the ID of the request that triggered them so it is now trivial to see everything that happened to a given record and which request caused it and which user did it. And since all request logs have the session ID, I can also easily see everything that user did during that session both before and after the event in question. I've been wanting this sort of logging solution since the early 2000s and I now have it and it is even more incredible than I imagined.

acidbaseextract · on March 27, 2019

Any blog posts or talks that inspired this? How do you handle retention? Any libraries or tools that helped with it?

I can see the use of such logging clearly, but I work at a young startup and implementing that ourselves sounds like infrastructure overkill. Maybe I'm overestimating the work?

Thanks for any thoughts!

griffordson · on March 29, 2019

Sorry, just saw your reply. I suspect you are overestimating the work. I just added some Rack middleware to handle request logging. Then plugged into the Rails model callbacks to record the CRUD events into a different Kinesis stream. And we're a 3 dev shop and I did this on my own over a few weeks part-time.

As far as retention goes, I just dump to it to S3 for permanent storage. Kinesis Firehose makes that dead simple. I also push each event into AWS hosted Elastic Search and I could theoretically purge older data, but for our volume of data, I'll likely retain it there for years.

Edit: Also, I generate UUID in the Rack middleware and set a Thread Local Storage variable that I can pull out in the CRUD logging so that I can always see what Rack request caused the CRUD model event.

acidbaseextract · on April 1, 2019

Sounds like I did way overestimate. Thank you for the structure!!

nerdponx · on March 27, 2019

This is what I've done and it works pretty well. You can usually filter out just the "bad" session IDs pretty easily and recover a more or less chronological explanation of what happened.

dspillett · on March 27, 2019

The problem I see there is that a serious issue party way through the process my man nothing gets logged. Also for a long running process you won't see the line until it is complete - this could even apply to what should be a short process when something is wrong.

A compromise might be to log each part individually as is more usual, but with IDs appended/prepended to each line to group them (which is useful anyway for analysis without this next step) and later process the raw log to produce one with the structured entries.

jerf · on March 27, 2019

This gets to one of my... pet peeves? hobbyhorses?..., which is the importance of the APIs you use. The general logging API is something like LogMessage(level, message, params....), assuming some sort of printf-like functionality. A logger object is probably passed around.

This is not a very good API, though, because it tends to afford that sort of logging, on two levels. First, it tends to encourage a lot of little individual lines dumped out, and second, it tends to make it difficult to have any additional context in the log messages. And you very often care about context in log messages. Log messages that look great during dev because you only ever tested one at a time become impossible to understand in production.

You really want an API more like msg = log.GetMessage(), where when "msg" falls out of scope it writes the log line (or when it commits), and you update msg throughout the operation and generate just the one line at the end. You can even make "log" itself something that carries context with it, for the microservice labeling case, and emits "msg"s that already have that context in it.

(Similarly, anytime I start using a cache I always put my own wrapper around it, because the correct interface for a cache layer is not "cache.get(key)", it's "cache.get(key, func() value)"; that is, a get request should be tied to "the code to generate the value". Otherwise, there will eventually be some code path that loads from the cache but doesn't handle the case it's missing, or some other mismanagement brought on by the fact you've detached the generation of the value from the use of the cache. Exceptions as always, but this is a much better default.)

chrisweekly · on March 27, 2019

Regarding the need for a tool, I can't say enough good things about [lnav](https://lnav.org). It's like a mini-ETL powertool at your fingertips, w/ an embedded SQLite db and a terrific API. As of mid-2016 when I first used it, querying logs was extremely easy, and reasonably fast (w/ up to several million rows). Highest recommendation.

flukus · on March 27, 2019

Many times I've created a log buffer (list of strings) that will get created as part of the unit of work that will blast everything to the actual logger at once so you have all the context surrounded by a "* BEGIN REQUEST *" type message. This keeps the simplicity of plain text with just enough structure to process if needed.

ma2rten · on March 27, 2019

Many companies use structured event logs for data analysis, but still use traditional logs for debugging.

pojzon · on March 27, 2019

I found it a good practice to create a custome thread factory which translates the request identifier to be appended to the log.

Its actually really easy to filter out everything connected only to that identifier after that.

Tho it may be challenging to debug what was happening in between.

kkarakk · on March 27, 2019

just dump your logs to elasticsearch and search with that stack, trying to eyeball it quickly gets unwieldy even in dev environments

sebcat · on March 27, 2019

Another best practice: Please do not use logging as an excuse to avoid implementing proper error propagation in libraries or utility code.

Bad:

  void something_failable(void *data) {
    if (!is_valid(data)) {
      log_err("invalid data"); 
    }
    
    /* ... */
  }

Less bad:

  int something_failable(xxx_t *ctx, void *data) {
    if (!is_valid(data)) {
      return xxx_err(ctx, "invalid data"); 
    }
    
    /* ... */
  }

_y4o5 · on March 27, 2019

Another best practice, don't write logic to deal with BUGS but instead assert.

Bad:

  int something_failable(xxx_t *ctx, void *data) {
    if (!is_valid(data)) {
      return xxx_err(ctx, "invalid data"); 
    }

    /* ... */
  }

Less bad:

  int something_failable(xxx_t *ctx, void *data) {
    ASSERT(is_valid(data));

    /* ... */
  }

dkersten · on March 27, 2019

Maybe "is_valid" is checking the validity of externally-provided/third party data, that is, it being invalid is a problem, but its not a bug.

rootlocus · on March 27, 2019

If it's failable, it's not a bug.

tom_ · on March 27, 2019

A fair point, but where is the evidence it applies here?

dmitryminkovsky · on March 27, 2019

Missing is probably the the most important thing about logging: be vigilant and never log sensitive user information or secret values like password reset tokens. Quite a few cases of compromised secondary/tertiary logging infrastructure leading to exploits of primary systems.

sethammons · on March 27, 2019

"be vigilant" seems to be the mind set, but I can't help but feel a system that protects from accidentally logging secrets is what is needed. Due to the varied way people log and what they log and when, it is likely unsolvable in the general sense.

We have structured logs (serialized objects as json that auto-get default fields like app name, timestamp, and such). These structures can have fields tagged as sensitive, and those values are redacted. This is still human error prone.

Radim · on March 27, 2019

> …a system that protects from accidentally logging secrets…

That ecosystem exists. There are SW tools, both on-prem and (less interestingly) cloud/SaaS, that sit in the middle and look for personal and sensitive data. You plug into their API, and then get to decide what to do when such information is detected: ignore, quarantine, redact, alert…

(Disclaimer: I built one such tool, https://pii-tools.com)

sethammons · on March 27, 2019

I'll look into this, thanks! I'm not sure how you will know hunter2 is a password, but interested to find out :)

dmitryminkovsky · on March 27, 2019

Yes “being vigilant” is not even actually sufficient.

linsomniac · on March 27, 2019

Totally agree. Sentry has password scrubbing functionality, BTW.

godot · on March 27, 2019

Tangential: With the recent screw-up of Facebook logging user passwords in cleartext in (what I presume to be) this type of logging, I thought (before clicking) that an article named "Application-Level Logging Best Practices" would say something about stripping out passwords etc. when logging anything that involves user inputs. In fact, I thought it was what prompted this post. :)

objectified · on March 27, 2019

In case you are still interested in reading about the security aspect of logging, there is the OWASP Logging Cheat Sheet dedicated to this:

https://github.com/OWASP/CheatSheetSeries/blob/master/cheats...

stefant · on March 27, 2019

Thx for sharing, interesting.

nurettin · on March 27, 2019

I would like to stress the importance of buffering and thread safety in your logging functions. A logger with buffering will far outperform just opening a file, writing and closing, which is default for many loggers.

http://fredpuls.com/site/softwaredevelopment/java/log4j/log4...

taneq · on March 27, 2019

Buffering is great as long as you can rely on the buffer getting flushed to disk in the event that the application crashes. Even more so if you're doing some kind of embedded development (which I realise is not the usual bread-and-butter here) where your system can trip a circuit breaker or be reset by a watchdog timer.

richardwhiuk · on March 27, 2019

One tactic is actually to keep the log buffer in RAM and then you can dump it from the core file. You can often have a RAM only buffer which has lower level logs than that which you can log to disk.

tonyarkles · on March 27, 2019

I absolutely love this approach and have used it successfully in the past, but there's a caveat: I have only encountered a handful of people in my career who know or care what a core file is, and even fewer who know how extract data out of one.

If you're building something that's going to primarily have strong systems people working on it, it's a great approach, especially for ephemeral state logs and debugging. If the "general public" is going to be trying to extract logs out of it, it's probably not very ergonomic for them.

Edit: taking it a level deeper, this is used in the Linux kernel during the early stages of booting. When I was doing bring-up on a board, I wasn't getting any console output when trying to boot the kernel. What I could do is let it sit for a bit, reset, and dump the kmesg buffer from u-boot to see what the kernel had been doing. Absolutely beautiful backdoor when trying to debug broken serial output.

anitil · on March 28, 2019

There's an article called something like "The logging system you can't have" which talks about using a similar thing. In the case of a panic, the device would restart and find the buffer in memory, allowing you to recover crash data across boots.

From memory they then pointed all logs as /dev/kmsg rather than /var/log/syslog or equivalent logger, so that all userspace logs were buffered immediately into memory. I think that the system logger then read from the buffer? Wish I could find the article.

richardwhiuk · on March 28, 2019

It's fairly easy to write a tool to collect the core file once it's dumped, and also fairly easy to write a tool which wraps gdb and prints out the buffer.

alexchamberlain · on March 27, 2019

A few loggers support level-specific configuration for this; for example, you can trigger a flush whenever an ERROR or above log is made.

StanAngeloff · on March 27, 2019

...and make sure logging code is safe, e.g., no NullPointerExceptions, OutOfBoundsExceptions, etc. (Dynamic languages make this more of a challenge) I've seen too many systems where customers are affected because a logging routine had failed.

jdwyah · on March 27, 2019

Being able to change log levels on the fly with specifics granularity is a game changer imo https://www.prefab.cloud/documentation/logging

Otherwise your logs are always either too verbose or not there when you need them.

tomerbd · on March 27, 2019

* No logging.

In a massive environment you would find that logging is almost needless and metrics take control. Then if you have an issue you take recording of events and replay on dev env instead of checking logs.

sethammons · on March 27, 2019

Metrics are great. In general, I subscribe to "don't log successes" (with the caveats such as log events tied to billing or that get passed as messages to other services).

But what if your problem only occurs at scale? You won't be able to replicate that you in your dev env. Error logs and event logs make sense in many (most?) services.

vkaku · on March 27, 2019

The deal with good logging is that good structured code makes debugging with logging easier.

The unsaid rule - If you're into async and suddenly printing a bunch of messages, logging will not help you. :)

marcosdumay · on March 27, 2019

Log synchronously. Either exploit the POSIX propensity of buffering lines, have an explicit synchronization on your IO routine (that doesn't need to stop the business code), or log each thread to a different stream.

sethammons · on March 27, 2019

Why would logging not help in an async flow? Link log entries with request IDs and consider structured logs. I analyze logs in massively concurrent systems all the time.

mberning · on March 27, 2019

site seems down, but off the top of my head

log rotation

log aggregation

log errors/exceptions at the error level, and nothing else

make judicious use of trace, info, debug, etc. levels

properly namespace loggers so that the log level can be set appropriately depending on the environment and current needs

get the fundamentals right - if you are logging an exception, do you actually pass the exception to the logger? Same goes for things like failed API calls. Don't just log "API call failed". Log the entire response including status code and messages.

pojzon · on March 27, 2019

Its missing a part about high frequency logging.

Something like "log level checks should take only a single clock cycle"

Looks more like a dev env best practices or apps with low traffic best practices

YorkshireSeason · on March 27, 2019

Would you be able to give (link to) information about best practises for high-traffic apps?

stefant · on March 27, 2019

Pls check https://www.nearform.com/blog/pino-the-fastest-node-js-logge...

pojzon · on March 27, 2019

I was talking from experience. Dont really follow any guidelines available online.

sethammons · on March 27, 2019

Don't forget to sanitize those API requests, else you will be like Facebook and can be logging passwords :)