Hacker News new | past | comments | ask | show | jobs | submit login
Application-Level Logging Best Practices (geshan.com.np)
191 points by geshan on March 27, 2019 | hide | past | favorite | 63 comments



Something I wish this discussed more was development vs production logs. It talks about a unique request identifier across microservices, but this applies within a single service too.

In your development environment, you're usually only doing one request/job at a time, and so logging each step along the way makes it easy to follow and debug. In production, with tens or even thousands of concurrent requests, this quicky becomes basically impossible.

I've become a huge fan of fewer but "wider" lines. Take sending an email.. I'm sure most people have seen (or even written) logs like:

* New message from me@example.com to you@other.com

* Subject "test 123"

* Generating message using template xyz

* Checking sender against block lists

* Sender block list check passed

* Connecting to SMTP server smtp1.example.com..

* SMTP authentication succeeded, server responded "235 2.7.0 Authentication successful"

* Send succeeded: "250 ok"

This is ok when you're writing the code initially, but much better is a single line (or single structured message):

* Outgoing mail message. from="me@example.com" to="you@example.com" Subject="test 123" blocklist=passed template=xyz smtp_server=smtp1.example.com smtp_auth="235 2.7.0 Authentication successful" smtp_Status="250 ok"

All the context needed (whether succeeded or not), compact, and concurrent requests are a non-issue. I can't say I'm fully on the "structured log" bandwagon just yet (mostly because I haven't found a tool I like enough to give up the simplicity of a text editor), but I can see the appeal for sure.


For Python, structlog (https://www.structlog.org/en/stable/index.html) is a good tool for what you want.

But I think you can do better than structlog, with causal/trace-based logging. I built a Python library called Eliot to do so (https://eliot.readthedocs.io).

You can use tracing in a single application just like you can for full distributed systems. This means each log message is part of a causal chain of actions, and you can see what caused it.

Log messages can still be viewed as individual disconnected messages, but you can also get a tree view (see https://eliot.readthedocs.io/en/stable/quickstart.html for the code that resulted in these logs):

    b1cb58cf-2c2f-45c0-92b2-838ac00b20cc
    └── check_links/1 ⇒ started
        ├── timestamp: 2017-10-27 20:42:47.206684
        ├── urls:
        │   ├── 0: http://eliot.readthedocs.io
        │   └── 1: http://nosuchurl
        ├── download/2/1 ⇒ started
        │   ├── timestamp: 2017-10-27 20:42:47.206933
        │   ├── url: http://eliot.readthedocs.io
        │   └── download/2/2 ⇒ succeeded
        │       └── timestamp: 2017-10-27 20:42:47.439203
        ├── download/3/1 ⇒ started
        │   ├── timestamp: 2017-10-27 20:42:47.439412
        │   ├── url: http://nosuchurl
        │   └── download/3/2 ⇒ failed
        │       ├── errno: None
        │       ├── exception: requests.exceptions.ConnectionError
        │       ├── reason: HTTPConnectionPool(host='nosuchurl', port=80): Max retries exceeded with url: / (Caused by NewConnec…
        │       └── timestamp: 2017-10-27 20:42:47.457133
        └── check_links/4 ⇒ failed
            ├── exception: builtins.ValueError
            ├── reason: HTTPConnectionPool(host='nosuchurl', port=80): Max retries exceeded with url: / (Caused by NewConnec…
            └── timestamp: 2017-10-27 20:42:47.457332


I always thought about building something like this. Excellent work! Even serializes to JSON. Really nice.

It would be nice to integrate it with Python's standard logging system similar to how structlog does this.


It can take messages _from_ the standard logging library, and they even show up in the correct place in the tree: https://eliot.readthedocs.io/en/stable/generating/migrating....

This is the preferable method in order to get the semantic model of tracing for all logs.


That is a fantastic library!


I really like the structured / long messages, but they have an issue in comparison. If your sending process fails badly between those phases, you won't know that with a single message. You may not even get any message at all.

Sure, those cases should log something / log exception / something should notice the failure. In practice one day you'll find a case where it doesn't happen.


I wonder if there's a sensible pattern for wrapping a segment in a context that accumulates log calls within until it errors or returns, then concats and dumps.


Probably, but some critical errors like stack overflow or out of memory could prevent the stack from unwinding gracefully or otherwise prevent the accumulated log from being written. And then you’d have nothing.


Conceptually, if you separate out the application generating the log from the one constructing the structure, so that the former is still streaming individual messages with added context information and the latter (probably a log viewing application, so it is completely out of the critical path of recording information) uses context information to provide a structured view, then that's not a problem.


Alternatively, as others have mentioned, log as usual but include an instance id and then use an external service/process to aggregate log entries with the same id together.


That's actually an example, not an alternative (the instance id is the “context information”, the external service/process is the separate component using the context information to construct the structured view.)


Ah, good point.


This is a popular concept. See https://www.npmjs.com/package/bunyan#raw--ringbuffer-stream for example. Many languages have an implementation for something like that.


I did this for a desktop app I wrote. The liver kept an in memory ring buffer and when there was an unexpected error, it would dump all the previous debug level logs to give context to the error.


I suppose you could concat strings and do it in a finally clause? Or use python’s context managers and do the logging in the __exit__().


Exactly my thinking. And the context wraps the view call and therefore the whole transaction.


One thing I try to be careful of, especially in your particular example, is to prefix log lines with session IDs. If my app is sending multiple e-mails simultaneously, I want to know which lines correspond to which sessions if one of them went south, because they're all likely to be jumbled up.

This is doubly true for logging multithreaded processes.


I'm now pushing application logging into a Kinesis Stream that gets firehosed to both S3 and Elastic Search. This includes both request logs and logs of any CRUD operation on my models (including all changed attributes with their old/new values.) There are separate logs for each, but the CRUD events get tagged with the ID of the request that triggered them so it is now trivial to see everything that happened to a given record and which request caused it and which user did it. And since all request logs have the session ID, I can also easily see everything that user did during that session both before and after the event in question. I've been wanting this sort of logging solution since the early 2000s and I now have it and it is even more incredible than I imagined.


Any blog posts or talks that inspired this? How do you handle retention? Any libraries or tools that helped with it?

I can see the use of such logging clearly, but I work at a young startup and implementing that ourselves sounds like infrastructure overkill. Maybe I'm overestimating the work?

Thanks for any thoughts!


Sorry, just saw your reply. I suspect you are overestimating the work. I just added some Rack middleware to handle request logging. Then plugged into the Rails model callbacks to record the CRUD events into a different Kinesis stream. And we're a 3 dev shop and I did this on my own over a few weeks part-time.

As far as retention goes, I just dump to it to S3 for permanent storage. Kinesis Firehose makes that dead simple. I also push each event into AWS hosted Elastic Search and I could theoretically purge older data, but for our volume of data, I'll likely retain it there for years.

Edit: Also, I generate UUID in the Rack middleware and set a Thread Local Storage variable that I can pull out in the CRUD logging so that I can always see what Rack request caused the CRUD model event.


Sounds like I did way overestimate. Thank you for the structure!!


This is what I've done and it works pretty well. You can usually filter out just the "bad" session IDs pretty easily and recover a more or less chronological explanation of what happened.


The problem I see there is that a serious issue party way through the process my man nothing gets logged. Also for a long running process you won't see the line until it is complete - this could even apply to what should be a short process when something is wrong.

A compromise might be to log each part individually as is more usual, but with IDs appended/prepended to each line to group them (which is useful anyway for analysis without this next step) and later process the raw log to produce one with the structured entries.


This gets to one of my... pet peeves? hobbyhorses?..., which is the importance of the APIs you use. The general logging API is something like LogMessage(level, message, params....), assuming some sort of printf-like functionality. A logger object is probably passed around.

This is not a very good API, though, because it tends to afford that sort of logging, on two levels. First, it tends to encourage a lot of little individual lines dumped out, and second, it tends to make it difficult to have any additional context in the log messages. And you very often care about context in log messages. Log messages that look great during dev because you only ever tested one at a time become impossible to understand in production.

You really want an API more like msg = log.GetMessage(), where when "msg" falls out of scope it writes the log line (or when it commits), and you update msg throughout the operation and generate just the one line at the end. You can even make "log" itself something that carries context with it, for the microservice labeling case, and emits "msg"s that already have that context in it.

(Similarly, anytime I start using a cache I always put my own wrapper around it, because the correct interface for a cache layer is not "cache.get(key)", it's "cache.get(key, func() value)"; that is, a get request should be tied to "the code to generate the value". Otherwise, there will eventually be some code path that loads from the cache but doesn't handle the case it's missing, or some other mismanagement brought on by the fact you've detached the generation of the value from the use of the cache. Exceptions as always, but this is a much better default.)


Regarding the need for a tool, I can't say enough good things about [lnav](https://lnav.org). It's like a mini-ETL powertool at your fingertips, w/ an embedded SQLite db and a terrific API. As of mid-2016 when I first used it, querying logs was extremely easy, and reasonably fast (w/ up to several million rows). Highest recommendation.


Many times I've created a log buffer (list of strings) that will get created as part of the unit of work that will blast everything to the actual logger at once so you have all the context surrounded by a "* BEGIN REQUEST *" type message. This keeps the simplicity of plain text with just enough structure to process if needed.


Many companies use structured event logs for data analysis, but still use traditional logs for debugging.


I found it a good practice to create a custome thread factory which translates the request identifier to be appended to the log.

Its actually really easy to filter out everything connected only to that identifier after that.

Tho it may be challenging to debug what was happening in between.


just dump your logs to elasticsearch and search with that stack, trying to eyeball it quickly gets unwieldy even in dev environments


Another best practice: Please do not use logging as an excuse to avoid implementing proper error propagation in libraries or utility code.

Bad:

  void something_failable(void *data) {
    if (!is_valid(data)) {
      log_err("invalid data"); 
    }
    
    /* ... */
  }
Less bad:

  int something_failable(xxx_t *ctx, void *data) {
    if (!is_valid(data)) {
      return xxx_err(ctx, "invalid data"); 
    }
    
    /* ... */
  }


Another best practice, don't write logic to deal with BUGS but instead assert.

Bad:

  int something_failable(xxx_t *ctx, void *data) {
    if (!is_valid(data)) {
      return xxx_err(ctx, "invalid data"); 
    }

    /* ... */
  }
Less bad:

  int something_failable(xxx_t *ctx, void *data) {
    ASSERT(is_valid(data));

    /* ... */
  }


Maybe "is_valid" is checking the validity of externally-provided/third party data, that is, it being invalid is a problem, but its not a bug.


If it's failable, it's not a bug.


A fair point, but where is the evidence it applies here?


Missing is probably the the most important thing about logging: be vigilant and never log sensitive user information or secret values like password reset tokens. Quite a few cases of compromised secondary/tertiary logging infrastructure leading to exploits of primary systems.


"be vigilant" seems to be the mind set, but I can't help but feel a system that protects from accidentally logging secrets is what is needed. Due to the varied way people log and what they log and when, it is likely unsolvable in the general sense.

We have structured logs (serialized objects as json that auto-get default fields like app name, timestamp, and such). These structures can have fields tagged as sensitive, and those values are redacted. This is still human error prone.


> …a system that protects from accidentally logging secrets…

That ecosystem exists. There are SW tools, both on-prem and (less interestingly) cloud/SaaS, that sit in the middle and look for personal and sensitive data. You plug into their API, and then get to decide what to do when such information is detected: ignore, quarantine, redact, alert…

(Disclaimer: I built one such tool, https://pii-tools.com)


I'll look into this, thanks! I'm not sure how you will know hunter2 is a password, but interested to find out :)


Yes “being vigilant” is not even actually sufficient.


Totally agree. Sentry has password scrubbing functionality, BTW.


Tangential: With the recent screw-up of Facebook logging user passwords in cleartext in (what I presume to be) this type of logging, I thought (before clicking) that an article named "Application-Level Logging Best Practices" would say something about stripping out passwords etc. when logging anything that involves user inputs. In fact, I thought it was what prompted this post. :)


In case you are still interested in reading about the security aspect of logging, there is the OWASP Logging Cheat Sheet dedicated to this:

https://github.com/OWASP/CheatSheetSeries/blob/master/cheats...


Thx for sharing, interesting.


I would like to stress the importance of buffering and thread safety in your logging functions. A logger with buffering will far outperform just opening a file, writing and closing, which is default for many loggers.

http://fredpuls.com/site/softwaredevelopment/java/log4j/log4...


Buffering is great as long as you can rely on the buffer getting flushed to disk in the event that the application crashes. Even more so if you're doing some kind of embedded development (which I realise is not the usual bread-and-butter here) where your system can trip a circuit breaker or be reset by a watchdog timer.


One tactic is actually to keep the log buffer in RAM and then you can dump it from the core file. You can often have a RAM only buffer which has lower level logs than that which you can log to disk.


I absolutely love this approach and have used it successfully in the past, but there's a caveat: I have only encountered a handful of people in my career who know or care what a core file is, and even fewer who know how extract data out of one.

If you're building something that's going to primarily have strong systems people working on it, it's a great approach, especially for ephemeral state logs and debugging. If the "general public" is going to be trying to extract logs out of it, it's probably not very ergonomic for them.

Edit: taking it a level deeper, this is used in the Linux kernel during the early stages of booting. When I was doing bring-up on a board, I wasn't getting any console output when trying to boot the kernel. What I could do is let it sit for a bit, reset, and dump the kmesg buffer from u-boot to see what the kernel had been doing. Absolutely beautiful backdoor when trying to debug broken serial output.


There's an article called something like "The logging system you can't have" which talks about using a similar thing. In the case of a panic, the device would restart and find the buffer in memory, allowing you to recover crash data across boots.

From memory they then pointed all logs as /dev/kmsg rather than /var/log/syslog or equivalent logger, so that all userspace logs were buffered immediately into memory. I think that the system logger then read from the buffer? Wish I could find the article.


It's fairly easy to write a tool to collect the core file once it's dumped, and also fairly easy to write a tool which wraps gdb and prints out the buffer.


A few loggers support level-specific configuration for this; for example, you can trigger a flush whenever an ERROR or above log is made.


...and make sure logging code is safe, e.g., no NullPointerExceptions, OutOfBoundsExceptions, etc. (Dynamic languages make this more of a challenge) I've seen too many systems where customers are affected because a logging routine had failed.


Being able to change log levels on the fly with specifics granularity is a game changer imo https://www.prefab.cloud/documentation/logging

Otherwise your logs are always either too verbose or not there when you need them.


* No logging.

In a massive environment you would find that logging is almost needless and metrics take control. Then if you have an issue you take recording of events and replay on dev env instead of checking logs.


Metrics are great. In general, I subscribe to "don't log successes" (with the caveats such as log events tied to billing or that get passed as messages to other services).

But what if your problem only occurs at scale? You won't be able to replicate that you in your dev env. Error logs and event logs make sense in many (most?) services.


The deal with good logging is that good structured code makes debugging with logging easier.

The unsaid rule - If you're into async and suddenly printing a bunch of messages, logging will not help you. :)


Log synchronously. Either exploit the POSIX propensity of buffering lines, have an explicit synchronization on your IO routine (that doesn't need to stop the business code), or log each thread to a different stream.


Why would logging not help in an async flow? Link log entries with request IDs and consider structured logs. I analyze logs in massively concurrent systems all the time.


site seems down, but off the top of my head

log rotation

log aggregation

log errors/exceptions at the error level, and nothing else

make judicious use of trace, info, debug, etc. levels

properly namespace loggers so that the log level can be set appropriately depending on the environment and current needs

get the fundamentals right - if you are logging an exception, do you actually pass the exception to the logger? Same goes for things like failed API calls. Don't just log "API call failed". Log the entire response including status code and messages.


Its missing a part about high frequency logging.

Something like "log level checks should take only a single clock cycle"

Looks more like a dev env best practices or apps with low traffic best practices


Would you be able to give (link to) information about best practises for high-traffic apps?



I was talking from experience. Dont really follow any guidelines available online.


Don't forget to sanitize those API requests, else you will be like Facebook and can be logging passwords :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: