Hacker News new | past | comments | ask | show | jobs | submit login
My Logging Best Practices (2020) (tuhrig.de)
386 points by isaacfrond on Oct 5, 2021 | hide | past | favorite | 153 comments



One thing I noticed in one of the article's examples is that the log contained an example user's PII:

    INFO  | User registered for newsletter. [user="Thomas", email="thomas@tuhrig.de"]
Ruh-roh! You now have a potential privacy incident ready to occur, if an inadvertent log leak ever happens. I think, at least these days, one vital component of any logging API is the ability to flag particular bits of information as PII so that it can get handled differently upon output, storage, or ingestion by other tools. For example, if the users's actual E-mail address is not specifically useful for debugging, do you need to display it in the default visualization of the log?

I think in 2020, with companies' privacy practices getting more and more scrutiny, any logging best practices doc should at least mention treatment of PII.


Yuuup! I've been looking through the Epik dump [1] and there's all sorts of PII where it shouldn't be. Logs for sure. Also object dumps as text fields. [2] And the most impressive mistake is logging every failed login attempt with the login and password. There are a ton of entries where people put the password in the username field and vice versa, so it's basically a big database of passwords of frequent users.

There are also a ton of tables where they made a backup copy of the table and just left it in the database for years. Ditto switching to new coce but not cleaning up the old data. So even data that would normally be removed due to, say, expiring domains ends up being preserved.

I recommend digging through it to anybody who is sad about the systems you run at work. It's such a fractal set of fuckups that I promise you'll feel better about your legacy systems.

[1] https://ddosecrets.com/wiki/Epik

[2] in PHP serialization format: https://en.wikipedia.org/wiki/PHP_serialization_format


> There are a ton of entries where people put the password in the username field and vice versa, so it's basically a big database of passwords of frequent users.

Or passwords these people use for other services.


Yup! With email addresses right there. I'm glad the haveibeenpwned people sent out a notice. But I'm sure some people had real messes to clean up.


Ah yeah, that's why in Caddy 2 while we do emit full details from HTTP requests in our structured JSON logs, you have the ability to traverse into each log message and replace or delete data within it: https://caddyserver.com/docs/caddyfile/directives/log#filter


I used to work with a medical software where we turned off logging completely in production, to mitigate the risk of accidentally log patient information.

We needed to log that kind of data during development and just tagging what messages were sensitive would have been too error prone, so we couldnt reliably have any logs once we went live.

We did however have a special crash handler that logged the exception type and callstack, but not even the exception message could touch the harddrive.


What I find works best is to avoid logging raw datatypes and instead limit it to objects/structs with custom toString. Then just make sure any PII fields are not included with toString either through some sort of annotation or by writing the implementation manually. This also encourages minimizing the code that actually interacts with PII such as replacing emails with user IDs ASAP.


There are cases where logging identifying info is needed. In case you have a rate limit by customer, or you have dynamic firewall rules that have to read the logs.

In this case, it is permitted to collect the log and keep them temporarily (24h or so). A workflow to keep deleting older logs or scrub the logs must be installed in this case.


Agreed.

I found out a few years back that a large industrial device our company made was recording all users passwords due to a logging statement that was blindly just logging every piece of user input via UI.


This reminds me of the time a developer turned on debugging to fix some credit card thing and was sending us credit card numbers in their logs.


I'm surprised there's no mention of tagging with some kind of correlation ID. I find that even generating a quick one-off GUID at the beginning of the request (to be used across the entire request) makes tracing server logs substantially easier. I just tag every log (and exception) with that ID, so when searching with grep or Splunk or whatever, I have a quick and easy way to disambiguate which log is connected to other logs.


And there's a standard for it: https://www.w3.org/TR/trace-context/

Many libraries and tools support the standard now.


And now I feel embarrassed for not knowing about this...Thanks for the link!


Don't, I only learned about it recently, and the standard itself is quite young!


Just an opinion, but I tend to think of this as distributed monolithic development... the more log correlation is needed, the more likely a requirement has been chopped up and then duct-taped back together over HTTP.


Not sure I follow that logic?

If you have a heavily used service, it's likely to have many concurrent requests logging into the same log stream and you absolutely need a request correlation identifier to be able to thread the log statements for a single request back together.

That's essential even if you don't actually call any other services as part of your implementation.


CorrelationIDs aren’t just good for hard failures. They’re also good for figuring out why your backend started getting 20% more traffic on Monday for no reason. Someone introduced a duplicate call somewhere. But where?


Late reply: This scenario is reasonable and I usually handle it by logging the thread ID in multi-threaded systems, which amounts to a "free" global variable.


What about distributed systems? I had to reach for this tool recently to trace requests as they propagate through up to 5 separate devices. Not everything is a web app.


> Log after, not before

My problem with that is what if the program crashes or produces an error. It's good to know what the program was attempting to do.

Imagine the output

    Successfully connected to DB.
    Successfully read config file. 
    Uncaught Exception: NullPointerException
vs.

    Will connect to DB...
    Will read config file...
    Will connect to API server...
    Uncaught Exception: NullPointerException


That's what the DEBUG level is for, as explained at the end of the section you're referring. So it would be like:

    D: Connecting to DB
    D: Reading config file
    D: Connecting to API server
    D: Sending updated record
    I: Updated user record for $USER
Or, if it fails, the last line is replaced by:

    E: Failed to connect to API server; could not update user record. [IP=..., ErrNo=..]


Personally I'm not a fan of DEBUG level (unless you always keep it on). You can have a transient system issue that appeared for 30 minutes and then went away. If DEBUG logs weren't turned on at the time, they won't help me.


Of course you always keep it on — the point of the different levels is that you can filter them out at read time (e.g. to eyeball logs, or to alert when the rate of >= WARN is too high)


This traditionally has not been log level's use case.

syslog() and setlogmask() being the obvious examples.

Many (if not all) companies I have worked out filtered out everything DEBUG at compile time to improve performance in production as well.

The best option I've seen is putting everything in DEBUG or lower into a memory ring buffer per process that's dumped and collated with other logs at each ERROR/FATAL log.


That is actually quite nifty. That way you can carry debug logs and only materialise/send them in case of errors when you are likely interested in it without having to put those logs at error and see them all the time etc.

Does anyone know of an implementation for this in any of the Java logging frameworks?


I've actually used this idea in production and it worked great.

In my web app if nothing unexpected happens, only INFO/LOG level stuff is pushed to logs. If, however, I get an exception, I will instead print out all the possible traces. I.e. I always store all the logs in-memory and choose what should be printed out based on the success/failure of the request.

Now, of course, this is just a web API running in AWS Lambda, and I don't have to care overly much about the machine just silently dying mid-request, so this might not work for some of you, but this works great for my use-case and I'm sure it will be enough for a lot of people out there.


Fwiw sentry does that out of the box, anything below warning (I think) is used as context for “actual events”.


Performance degradation is noticeable with DEBUG level in production.


Performance degradation is noticeable with excessive logging; the level is irrelevant.


Debug was not designed to always be left on. That's why you need to enable it in the first place.


Leaving debug levels on in production will bankrupt most companies.



Nope, this produces unmanageable log volume.


How does making these DEBUG logs into INFO logs make the volume manageable?

Or, to flip that around, if you take a program that produces a manageable amount of INFO logs, and change some of those INFOs to DEBUGs, how does that suddenly become unmanageable?


My point is that DEBUG level logging is (hopefully!) not on by default, and that this is what it makes the production log volume manageable.

My experience has been that 1 customer-facing byte tends to generate something like ~10 DEBUG-level telemetry bytes. This level of request amplification can't be feasibly sustained at nontrivial request volumes, your logging infrastructure would dwarf your production infrastructure.


Why not have the program write logs to a temp file somewhere at trace/debug level, and only output the standard level to the console?


Wouldn't that end up slowing down the whole system from writing to a file though?


Ah, suppose it depends on how much you're doing and/or how verbose your logs are. If you've got lines speeding past then yeah I maybe wouldn't write them all.


Writing verbose logs is very expensive.


It depends which costs what.

In a case where there are plenty of resources and the thing you are logging is very heavy then even verbose logging is trivial.

If you are beset by failures it is much cheaper than making your developers guess.


The log file should, by default, have all necessary info to debug the issue - or at the very least enough information to narrow down the problem so far that you can replicate it locally.

I've spent way too many hours adjusting log levels on a client environment and trying to replicate the issue without breaking or losing data while doing it.


Debug level is not for production, it's for debugging.


Have you never needed to debug an issue that was only happening on production?


Even though it can be lost by an optimistic configuration, I prefer to use the TRACE level(s) and leave the DEBUG one for, well, debugging.


We're discussing a NullPointerException situation; how is that not debugging?


I think OP's case was not debugging a crash, just that if a crash happens he would like the log to be before rather than after. The log statements would be there no matter whether there is a crash or not, just in case.


I prefer to have both information (the actual logs would be structured)

    Connecting to DB
    Connected to DB → OK
    Reading config file
    Read config file → OK

Even better, but hardly doable because of the nature of logging

    Connecting to DB ... OK
    Reading config file ... OK


Performance though?


Diagnosis of failure?


I think he is assuming that the logged exception will include the details needed to track down the error (line number, stack trace)


Of course in an ideal scenario that's the case. Nevertheless programs do sometimes just "crash" without an error message, or maybe they hang and never produce any output due to an infinite loop, or excessive memory allocation and GC overhead, etc. Logging what the program was attempting to do is useful in those scenarios.


Yes, it’s useful, but that occasional usefulness is substantially outweighed by it’s perpetual cost.


Yes. I think logging before is still useful, but it is for hangs, not exceptions.

If your program disappears into a black hole doing some operation that doesn't have (and needs) a timeout specified, you want to know what it was about to do when you last heard from it.


I've found the most useful log before/after pattern is a little-known dynamic variant.

Imagine this success output, at a low (not tracing) log level:

  Opened DB connection
Error output at the same logging level:

  Opening DB connection
    Opening TCP connection to 127.0.0.1:5678
      Error: No more file descriptors
    Error: Failed opening TCP connection
  Error: Failed opening DB connection
Crash output at the same logging level:

  Opening DB connection
    Opening TCP connection to 127.0.0.1:5678
    Starting TLS negotiation
      Fatal: Segmentation fault
      <stacktrace...>
Success output with tracing turned up and no error:

  Opening DB connection
    Opened TCP connection to 127.0.0.1:5678
    TLS negotiation completed in 0.1ms
    DB version 5.1 features A, B, C
  Opened DB connection
The key feature here is that there's a scoped before/after message pair ("Opening DB connection" and "Opened DB connection"), and the before part of the pair is output if and only if any messages are output in between.

The contour around messages in between is visible (here with indentation), and the before message is not shown if there are no intermediate messages. Sometimes it's useful to show a different after-message if the before/after pair is not output, to include a small amount of information in the single message, or just to improve the language in the quieter case. Sometimes it's useful to suppress the after-message just like the before-message, if there is nothing in between.

Anything that might terminate the program, such as the Segmentation fault with stacktrace, counts as an intermediate message, and so triggers the output of all the before messages as part of the termination message logging. For this to work, of course you must have some way for the before messages to be queued that is available to the crash handler.

This is a poorly known technique, which is a shame as in my opinion it's a very helpful technique.

Its chief benefit isn't saving a few messages, it's that it scales to complex, deeply called systems with heavy amounts of tracepoints encouraged throughout the code. When studying issues, you may turn on tracing for some components and actions. As you turn on tracing for a particular component, before-messages from caller components become activated dynamically, ensuring there's always context about what's happening.

In case you're wondering, it works fine with multiple threads, async contexts, and even across network services, if you have the luxury of being able to implement your choice of logging across services, but of course there needs to be some context id on each line to disambiguate.


Got any example implementations? "Dynamic logging" isn't a great search phrase


Generally agree but this also leans a bit towards printf-debugging.

Exceptions should log their call stack, making both the before and after logs redundant.

Save the logs for significant events.


Agreed. Logging before gives you a lot of information to diagnose an issue. Logging just exceptions usually lacks a lot of necessary information


Just print out all necessary information, in case of exception, and suggest a potential solution to the problem.


This does not work in practice a lot of the time. What if the problem was actually due to some other thread and said other thread didn't crash? If you only log after success/failure, you would have absolutely no idea why the program crashed. Oftentimes if the actual error were anticipated enough for good reporting, said error would not exist in the first place.


Would it make sense to log a traceback, on unexpected errors?


+ Use structured logging (ex: https://www.structlog.org/en/stable/why.html)


Structured logging is just tracing but less powerful. "Context" and "binds" are just poor-man's spans. If you're going to go through the effort to redo your whole application's logging infrastructure just bite the bullet and use OpenTelemetry.

Edit: Good lord people, set the collector to never sample and use one of the many file exporters and you get completely bog standard logs but with a huge amount of work already done for you and enough context to follow complicated multi-threaded/multi-process programs without making your head spin.


I think there's room for both.

Tracing is usually sampled, or much more time limited, and aggregation tooling isn't nearly as good compared to logs with most providers. Much better for deep dives into a narrow context. For a single request drilldown or aggregated performance/error metrics, tracing every time.

Structured logging tends to give you rich search against broader contexts (e.g. long spans of time, many requests in aggregate, random non performance/error attributes you may care about) without the same concerns of data cardinality.

I felt nowhere near the same power with tracing tooling for random broad context as I've had with e.g. Splunk for logs. That's not to say you can't somehow have the best of both worlds, but I haven't seen it in the ecosystem just yet.


Indeed. For those unfamiliar, take a quick read of https://github.com/open-telemetry/opentelemetry-specificatio...

There are a few concepts to understand, but instrumenting it right from the start is valuable as it grows into a bigger system. Being a cross-cutting concern, the more code that standardizes on this the more useful it will be.


So much this. What’s important to know is that structured logging is as expensive as tracing. If you are forced into the cost of a trace span for every message, you won’t do much logging. Maybe this makes sense to Python programmers to whom a few microseconds here and there don’t seem significant, but to C++ programmers it is pure lunacy. The cost of informational logging with something like glog or spdlog can be as little as a few CPU cycles.


I don't see why basic structured logging would be more significantly more expensive or complex that unstructured logging.

In the simplest case you could just output a sequence of values (for example as a csv, or one json array per line), instead of an interpolated string. The size and CPU overhead of this approach is minimal, but makes it much easier and more robust to parse the log afterwards.

Such basic structured logging is quite similar to what the article proposes in "Separate parameters and messages", but more consistent and unambitious if one of the arguments contains your separator character.


Correctly and safely serializing to CSV and JSON is more overhead than spitting out bytes. It's also easier to spit or ingest bytes as a stream—yes, there are tools that understand streams of json objects, but you still have to have all the data in memory at once for a given record, to serialize it, and then you've basically got it in memory twice at the same time (original + serialized text). Your minimum record size to consider becomes "this entire object" rather than "this field". You've also now got the code for CSV or JSON serialization in memory.


Serializing is very fast, and ingesting JSON as a stream is trivial when it's newline delineated. No one emits a giant array of json log objects, they emit a stream of newline delineated json.

You almost certainly also have it in memory more than twice since you're copying a buffer out to IO.

If you care about record size use protobuf, compression, backend parsers, etc.


Unless we're talking about something performance critical up front like an embedded system, odds are by the time an organization is going to care about logging performance overhead, they probably already have a few engineers dedicated to just optimizing logging.

Of course some developers may just go ahead and create a JSON/CSV string from scratch, which is about as safe as string interpolation instead of parameter bindings for SQL statements /s. But provided they're using a proper JSON/CSV/proto object builder library that uses String references, no data is going to get copied until a logging component needs to write it to a file or a network buffer. Therefore the only overhead are the pointers that form the structure.


You can json or csv encode data in a streaming manner, you just need to escape some special characters (only " for csv) and add a separator. But I never worked on an application where log records were big enough for streaming to make sense.

If you want to preserve the "one line = one record" property typically expected from logs, standard csv will not be suitable (since it doesn't escape linebreaks), but json without unnecessary linebreaks will handle that (known as json-lines).


Structured logging is just taking whatever information you already intended to log anyway and giving it a consistent structure. Ie, instead of

"$TIMESTAMP $LEVEL: Hey look index $INDEX works!"

you have

$TIMESTAMP $LEVEL $FORMATTER $MSG $INDEX

I'm quite intentionally drawing a parallel to `echo` vs `printf` here, where printf has an interface that communicates intended format while echo does not.

The only overhead of structured logs done right is that you need to be explicit about your intended formatting on some level, but now with the bonus of reliable structure. Whether you use glog or spdlog or whatever is somewhat inconsequential. It's about whether you have an essentially unknown format for your data, having thrown away the context of whatever you wish to convey, or whether you embed the formatting with the string itself so that it can later be parsed and understood by more than just a human.

If you're concerned about the extra volume of including something like "[%H:%M:%S %z] [%n] [%^---%L---%$] [thread %t] %v" on every log entry, then you use eg (in GLOG equivalent for your language) LOG_IF, DLOG, CHECK, or whatever additional macros you need for performance.

If I'm wrong or just misunderstanding, please do correct me.


LOG_IF and VLOG, and related concepts from other libraries, have their places but I don't consider not logging to be a solution to expensive logging. In fact, that was my whole original point.

With structured logging the outputs may be trivial in which case the compiler can do a good job. But if you are inserting a std::string into a log line and your log output format is JSON, there's nothing the compiler can do about the fact that the string must be scanned for metacharacters and possibly escaped, which will be strictly more costly than just copying the string into the output.


Please understand my line of inquiry is sincere and not intended to be read as bad-faith-argumentative. I say this only because tone is so hard to convey.

Anyway.

Isn’t the choice of eg JSON-string vs any other string format somewhat beside the point? Wouldn’t you either 1) need to scan for metacharacters or 2) NOT need to scan for metacharacters, regardless of whether your log is structured or unstructured, at time of output?

Ignoring for a moment the additional cost per character of outputting additional data, but ignoring nothing else in the scenario, wouldn’t something like, for example:

“My index is {index}␟index=5”

cost the same to output as:

“My index is 5”?

It seems to me that the cost of interpreting the formatting doesn’t need to be paid until you wish to read the log message, which presumably you don’t strictly need to do at all until you actually care about the content of the message, or at the very least can defer the cost until a less critical moment, or do some “progressive enhancement” dependent on some set of pre-requisite conditions.


Wrong. Structured logging in C++ done right [0] is the most efficient you can do. Plus you get all mentioned benefits.

[0] https://youtu.be/FyJI4Z6jD4w


Logging is significantly cheaper than tracing to maintain in a usable state.


Maybe it’s just a budget pathology, but we sample our traces. Browsing the available traces can be informative, but you can’t just pull a trace for an arbitrary request. We do keep and index all the logs (for a few days).


> Structured logging is just tracing but less powerful.

That may be the case, however is what you have now is "logging", I would 100% recommend the incremental upgrade to "structured logging".


Totally agree. One huge advantage not mentioned specifically at that structured logging link is handling privacy. A logging pipeline can be set up to have a collector that automatically redacts privacy-related or sensitive fields. Only give developers access to redacted logs to help prevent leaks.


Who would have access in the first place besides said devs?


The ops people. In many companies access to production is limited.


It's important for data to be limited to those with business need to know, which is why developers (who are responsible for the application) should have access to its data while ops (who are responsible for core infrastructure) must not.

I would not put an email address in logs, but definitely some kind of account identifier so that we can investigate bug reports from specific users.


What you want to prevent is logdumping to some third party which is a pretty easy exploit in many applications I have touched. Ops always have access to the data I think that is why we have devops, so maybe you are talking about some other kind of ops.


Yes. This. Been using log15 in Go for a while now and really like it.

https://github.com/inconshreveable/log15


I totally disagree with the “log after “. I have seen it multiple times where the log ended but it wasn’t possible to tell in what code section because some call had thrown an exception. I much prefer to see a log where I see lines like “attempting to do x with parameters a,b,c”.

In general I set the bar for “ too much logging” very high. It’s a better problem to have too much log to store and sift through vs having no data.


In that case, the stack trace from the exception should be able to tell you that you failed on the line _before_ the "log after".

It won't be nearly as convenient to dig that fact out, though.


> the exception should be able to tell you that you failed on the line _before_ the "log after".

It is an extension of the «but it worked on my laptop» mentality.

No, it won't work in a complex environment, specifically in a scenario where a server farm sits behind a load balancer with a VIP (geographically distributed or not), which is ubiquitous. Specifically, it won't work in a situation where a server instance the application attempted to connect to sits in a subnet in another data centre across a WAN link, and the networks put a shoddy firewall rule in last night or applied a bit rotten firmware patch to the router, and new server connections suddenly started getting TCP resets halfway through the data exchange. You won't know which particular service instance the resolver ended up resolving and connecting to unless you have a "before" log entry as well. Hiding the dead horse in the cloud will help in a serverless or in a simple scenario, but, if you have a load balancer and multiple availability zones, you still have the same problem, due to cloud load balancers failing over to another AZ at will (it is more nuanced than that, though).

Logging inputs is as critically important as logging preceding statement's outcome, albeit it has to be used wisely as excessive and, worse, not properly structured log entries can (and will) overwhelm the log sink in a high data volume environment. Log ring buffers can be a solution for some cases, and a monitoring component that is integrated into an app can be an answer in some other cases, but there is no single, one size fits it all, solution for an arbitrary setup in a high data volume environment.


In the apis that I am familiar with, there is typically generic, cross-cutting logging as a filter on all http requests, so you always log the "GET on /customers/123".

This is a given, it's typically done by "add the library, one line of code at startup to enable it"

While this technically qualifies as "Logging inputs before" it does not seem to be what the author is talking about. I was assuming that was present, and had been reached - cause if it had not, what request are you even debugging? Rather than some "«but it worked on my laptop» mentality" straw man.


That assumes that the all fail cases will create an exception and properly pass on that a failure occurred, with enough information so that the caller can decide what to do next.


The assumption upfront is that you don't reach the "log after" point. IMHO, given that, not having an exception will be rare, and probably involve code that looks "off" and fixing that is the first task. YMMV though, depending on language


You don’t know the input that caused the exception. Some exceptions give you good detail but a lot don’t.


I would say that since I have spent time ensuring that exceptions are properly detailed, and those details properly logged (1), this is correct, but is not an invariant that makes exceptions useless. It is often something to fix.

1) shout out to https://github.com/RehanSaeed/Serilog.Exceptions


The author does log before, just at DEBUG level. In the scenario you describe, the information you desire would be available.


I've also seen issues with logging after.

It feels more natural to log when something is complete instead of constantly saying when you're starting to do something, but it leads to issues where you have no idea what the system was actually doing when it failed. It's very easy to have the process die without logging what made it die, especially if you're using a language that doesn't provide stack traces or you have them turned off for performance.

Logging before an action also means it's possible to know what's going on when a process hangs. It might lead me to where I should look to resolve the problem instead of forcing me to just restart the service and hope for the best.


Another option is to log after, but in a cross cutting way, e.g. using a middleware or filter. Extra properties can be added along the way, and if there's an exception the middleware can handle it and add the problem to the log event.

The power of structured logging here is in enriching events to make them queryable later, so you could e.g. view all events for a request.


> In general I set the bar for “ too much logging” very high.

Which is pretty good practice because when things go wrong in Prod, you want your logs to tell you quickly what went wrong. Too many times I have seen that something goes wrong in Prod and then people say "Oh we should have logged that"


Log after is SOP in any reasonably large architecture. Too much logging is a problem you encounter very quickly with nontrivial traffic.


That’s a good point. However, I suspect that a lot of people are not working on a scale that would qualify as “large”. I am just working on a project where the infrastructure guys planned for massive scaling but so far the highest we ever got was 3 pods and we easily could have handled this with one server. I think I would wait for logging to become a problem before reducing it. If that ever happens.


> Log after, not before

I tried this recently by accident, and the drawback of that approach immediately became apparent:

The logs will be very difficult to read, because for nested function calls, low-level callee functions will log before high-level caller functions, and you have no idea why they are called until later:

    Finished string comparison of length 4
    ... 200 more lines...
    Finished string comparison of length 2
    Finished Boyer-Moore string search
    Finished searching for example.com
    Finished searching for disallowed domains
    Finished checking email address
    Finished logging in user
With the suggested approach, you'll have to read your logs backwards to make sense of them.


There's nothing wrong with reading logs backwards and it is arguably better because (when testing or diagnosing exceptions) finding the end of the log is generally quicker than finding the point midway through the log where your task ran.

There's nothing stopping you from adding "Begin User Login" logging calls if you want to make the nesting more explicit. If you no idea why stuff is being called, it's only because you have unlogged branches in your code.

and it is way better than reporting success when it hasn't actually happened.


You only emit a single log event (line) per request. Problem solved.


"Best Practices" really is synonymous with "Personal Preferences" if it can work in headlines like this, isn't it?


"my" + "best practices" = "personal preferences"


"my" + "personal preferences" = "best practices"


The author does make some effort to explain why.


That author however has some reading to do around structured logging and related practices, which might raise their personal "best".


> Log after, not before

I'd say this is good for production. During development, I sometimes like to log the name of the function I'm in as the first line of the function. That way I'm assured to know if the function was even called. I will also put before and after logs if I want to see how data is changed (Yeah, there are things called debuggers that let you watch variables, but sometimes a log is just so much easier - such as in heavily threaded code or code with a number of timers and events firing).


That's why the author wrote:

> I make exceptions for DEBUG.


Yes, but I think the author could have expanded on that tbh.


I was expecting way too much in this article. But, I don't see anything about logging requestId's etc.

Sharing a few things briefly below I have learned over the past few years -

1. Log every single message for a transaction/request with the requestId to look at the entire picture. Ideally, that requestId would be used upstream/downstream too internally for finding exact point of failure in a sea of micro services. 2. Log all essential id's/PK's for that particular transaction. - makes debugging easier. 3. Log all the messages possibly in a json format, if the log aggregator you are using supports that. Parsing later for dashboards makes lives WAY EASIER and is more readable too. Might also reduce overall computations being run on the cloud to extract values you want -> hence, lesser cost. 4. Having error/warn logs is good, but having success/positive logs is equally important imo. 5. Oh, and be very very careful about what you end up logging. We don't want to log any sort of user PII. At one of my previous companies, there were mistakes made where we were logging user's phone numbers, addresses etc.

I am sure the community here already knows about these, and might have even more to add. Would love to hear what other people are doing.

Cheers.


In asynchronous apps it could be better to log both, before and after. You don't know what will be possible logged in the maentime.

With info and success level you could log info before and success after execution.

In dev mode you show info, in production you could filter success.


Will Sargent did a nice talk on logging at Scalacon. My biggest takeaway was that pre-choosing a 'log level' for what to skip is making the decision too early (it's hard to guess what messages we'll want), and too coarse-grained (we don't want verbose messages when we follow the happy-path, but tricky/exceptional paths may want more logging).

His suggestion is to accumulate verbose messages in memory/disk, so we can later decide whether to dump them out (e.g. in a top-level error handler).

The talk suggested an SQLite DB, with dynamically-configurable filters for what to keep, etc. but that's a step beyond ;)


> His suggestion is to accumulate verbose messages in memory/disk, so we can later decide whether to dump them out (e.g. in a top-level error handler).

Interesting; I've definitely seen that with unit test frameworks. Where in addition to assertEquals(...) there was an annotate(...) method which just took a string, and was stored for that unit test. If the unit test eventually failed, all the previous annotations for that unit test were printed, otherwise they weren't.

https://metacpan.org/pod/Test::Unit::TestCase#annotate-(MESS...


I really do not think that logs belong in any sort of database. Perhaps there is something I have not thought of?

Structured logs (of course structured) in a simple text file. Rotate the files if your antiquated system cannot handle big files or if you wildly successful crypto currency scam generates many many gigabytes a time period.....

`grep` is your friend!


I like the idea of a simple SQLite DB, but agree it's a bit overkill. In particular, it increases the barrier to writing a log message (what structure/keys/etc. to choose?), which disincentivises logging.

The accumulate-then-dump idea doesn't require any database though; just an in-memory buffer (e.g. a linked-list). For example, we can have all ERROR/WARNING/INFO messages output immediately, as is current practice. The difference would be for DEBUG messages:

- Common practice is that if DEBUG is set then we'll output DEBUG messages as they're encountered; if DEBUG is not set then we'll skip over them.

- The suggestion is that if DEBUG is not set, then instead of skipping messages the logger would put them in a buffer. If an unrecoverable error occurs, we can put a call like `logger.dumpDebug()` in the handler, so all those DEBUG messages will appear in the logs to help us diagnose the error. If no error occurs, the buffer just gets discarded.

This obvious takes more memory, but fits well with a request/response setup like a server. Long-running applications/games could either use a ring buffer, or identify points where it makes sense to empty the buffer.


+ Use correlation ids on more complex applications so you can track more than a single failed call into an application.

Getting an error like "this thing fell over because of a number parsing error" might not be very useful when you can't recreate it.

With correlation, you then realise that the user did something specific beforehand and then the cause is more obvious.


Also, it goes without saying, but never buffer logs. Each line should end up with a write(2) call or something equivalent. If you buffer logs, and the process crashes, you lose logs exactly when you really really need it.


I agree with you in theory if resources were infinite but there is no way to log at the rates developers actually want to log and have a performant application without buffering.

I think the more practical advice is "buffer logs outside your application" or "buffer logs in a separate thread/process that crashes independently of your application code."

Even the small places I've worked could easily push 250G logs a day.


I guess "buffering in a separate process" can be handy, esp. if you have strict performance requirements, but I guess using shared memory correctly can be tricky...?

(Without shared memory, I imagine you'll have one system call per log anyway - in which case it's not much different from calling one write() per line.)


I like this post, and I think the recommendations are sound. I'm not quite sold on the technical vs business logic distinction between debug and info. I use debug for all those messages I only want to see if something has gone wrong, and info for all those I want to see to know its working. That might fall along the business vs technical distinction, but it might not.


Short, clear, and pragmatic. This is the kind of post I'd love to see every day.


I completely disagree with Thomas' first point. It is better IMO to always log before you do something so that if you crash during whatever was happening, you know what you were trying to do. Additionally, if you have enough logging, you can figure out that some call did succeed by implication so long as you have access to the source code/know what the order of high level operations are.

If for some reason you can't easily figure out whether something succeeded, you should probably log before and after the operation. I personally try to avoid doing this since it can clutter up the log, but it can be useful depending on how code is structured.

Another minor point is that I prefer to have INFO cram as much information as possible with DEBUG reserved for stuff which is ignored unless you explicitly turn it on either due to lack of utility vs cost.


Great post! It’s also the first time that I see a concise explanation of the difference between info and debug levels.


> Log after, not before

This is already weird, what about operations that take a long time? If the session is interactive, the user should definitely know that nothing hung up. Or is it out of scope somehow?


For that there is is usually some log wrapper that times the operation, e.g.

   using(_log.TimeOperation("Doing the thing"))
   {
      DoTheThing();
   }
So that you get a start message, and an end message with timing, and if you exited with normal control flow vs. with an exception.

Author doesn't seem to have reached that level of best practice yet.


Yeah, it's best to log BOTH before and after.


Or if you think both is too much, probably only before? Might depend on te application but I've seen enough native application logfiles with "did FOO" as very last line meaning the crash happened on the action executed next, but that wasn't logged because 'log after!'...


Logging before has the disadvantage of being noisier than after only, at least if you want to inspect the result of the operation.


The author later explains that in such situations, you log before at DEBUG level, and after at INFO.


I tend to use VERBOSE severity if I'm just logging "we've reached this line in the code and local variables are set to $whatever".

Since you generally only want to see them when diagnosing a frozen job or something equally obtuse.


Don't SPAM the logs.

You don't need to log every method call, every variable, every "if" statement, ect. That's what the debugger is for.

And, I shouldn't need to say it: Don't log passwords or other sensitive information.

Also: Make sure your logger is configured correctly. Log4Net used to (and it still might) calculate an entire stack trace with each call to the logger. It can be the single highest consumption of CPU in an application; even if you aren't actually logging stack traces.


But if I don’t log everything, why did I build a massive ELK stack for my big data logs


> You don't need to log every method call, every variable, every "if" statement, ect. That's what the debugger is for.

Lambdas don't have debuggers afaik. So, verbose logging is the simplest (only?) way out. Probably not at the "let's log every action" level but I find myself erring on the side of logging more rather than less.


If you're using a tool like Splunk, verbosity of logging is usually not much of a problem.


Only at trivial request volume. It’s easy to overwhelm any logging system without discipline.


Nah, Splunk can handle anything you throw at it.


I am currently at an organisation where Splunk is a bottleneck.


No, you are currently at an organization that hasn't allocated enough resources to Splunk.


Hah! OK, sure.


Users sure do appreciate people who put love into their logging.

Remember that just because you love to hear what your application is doing, it's not always be helpful or useful to your user. Give consideration to the Rule of Silence (http://www.linfo.org/rule_of_silence.html) and even if you do not agree with it remember that some of your users might.


I read a piece somewhere that really stuck with me. I can't find it again, but the idea was that for complex systems, logging discrete events makes little sense. It's not sensible to operate complex systems by discrete events, in part because there are so many things happening, but also because it might be hard to figure out what nth-order effects events have.

Instead, to figure out how well a complex system is running, you need to monitor continuous values. Things like flow volume across boundaries, level or volume in buffers or storage, composition of things as fractions of the whole, and so on.

Other industries (nuclear, aeronautics, oil and gas, electricity) know this already, but us software engineers have been slow to pick it up.

That resonates entirely with me, but I still have trouble putting it into practise. Does anyone have references to stuff written about this?


> Things like flow volume across boundaries, level or volume in buffers or storage, composition of things as fractions of the whole, and so on.

Sounds expensive. I'd say logging in IT hasn't been as advanced because the tradeoffs most software projects chose are very different from requirements of nuclear, aeronautics, etc.


I think you conflate application insight with error tracing.


A friend of mine wrote a similar article back in 2006: https://nerds.kueski.com/better-logging/

We used that as a standard for logging for at some point, it has some pretty good insights.


For python logging specifically, here's an excellent article: https://blog.guilatrova.dev/how-to-log-in-python-like-a-pro/


I use WARN for "might be a problem". Only ERROR definitely requires action.

An elevated number of WARNs indicates a problem. So this is a metric we monitor. And obviously ERRORs since that means there is a problem.


I always try to think of the log like I think of the UI, i.e. make it usable with as little knowledge as possible. If the user/1st line support can use it without calling you that's the goal.


I have noticed I enjoy instead of throwing errors, use a generic Event type to determine if that action was ok or not. Like in Elixir (but in TypeScript). Makes for easy logging too as you have to write an error message for each potential error the app might throw versus letting things crash and then figuring out what happened.

I guess though you can write explicit thrown error messages too. It's just throwing and catching seems more annoying and indirectional compared to the events.


MDCs (Mapped Diagnostic Context) are always helpful with logging. In my opinion this is better than using a list of params as in the article.

The MDC can be honored with Multithreading too.

http://logback.qos.ch/manual/mdc.html


Don't forget exit logging[1], I am a bit ashamed how far into my career I was before understanding this concept.

[1]: https://pinecoder.dev/blog/2020-04-27/node-exit-logging


Another thing to think about is how the log looks on info, vs on warn, and debug.

In production, if you set to warn only, will you have enough data? If you set to info, will you have too much?


Two more points:

- Always log errors

- Don't log errors twice (log or rethrow)


} catch(e: UnauthorizedException) { log.warn("Request to REST API was rejected

Don't throw exceptions for auth failures.


I love these suggestions, but what is most important is that you make these choices (or others) and stick to them.

Consistency in how you log is probably Best Practice #0, IMO.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: