Hacker News new | past | comments | ask | show | jobs | submit login

I have a persistent mental nag that the way we do logging always seems so inefficient: that theoretically no application should ever be outputting actual text messages, because logs are basically fixed strings + formatting data.

So in theory we could uniquely identify all logs as a much more compact binary representation + a lookup table we ship with the executable.




Yep, every actually efficient logging system does it that way. It is the only way you can log fast enough to saturate memory bandwidth or output billions of logs per core-second.

You can see a fairly general explanation of the concept here: https://messagetemplates.org/


Microsoft had a poorly documented tool called tracewpp that did this. Blindingly fast logging with very little runtime overhead.

It was hard to figure out how to use it without documentation so it wasn’t very popular. No idea if they still ship it in the DDK.

It was a preprocessor that converted logging macros into string table references so there was no runtime formatting. You decoded the binary logs with another tool after the fact.

Vaguely remember some open source cross platform tool that did something similar but the name escapes me now.


I think the Windows event log works like this. Sadly it's very opaque and difficult to use for non admin apps (you need admin rights to install your logs for the first time. Afterwards you can run with less privileges.)


If you're thinking of ETW (event tracing for Windows) and not the actual Windows EventLog itself, then you're right. traceWPP used ETW under the hood to record logging as ETW events in a file.


The Windows Event Log also used (uses?) this idea of pre-defined messages and you just supplied an event ID and data to fill in the blanks in the message.

Originally there was only one system-wide application event log and you needed to be admin to install your message definitions but it all changed in Vista (IIRC). I'd lost interest by then so I don't know how it works now. I do know that the event log viewer is orders of magnitude slower than it was before the refit.


It’s not the same system. The message catalog based method requires entering your message strings in .mc files during compilation. It is an aid for localization and display is up to the application as to time and method of output.

ETW is for high speed general purpose logging of low-level events with multiple collection mechanisms including realtime capture.


> It was a preprocessor that converted logging macros into string table references so there was no runtime formatting. You decoded the binary logs with another tool after the fact.

Is be curious how that stacks up against something like zstd with a predefined dictionary.


Apple’s os_log API also works this way.


Compression will basically achieve that kind of encoding without tightly coupling your logging with an external enum or schema.


Compression is only ever as good as the information it is given, if given text, it can do an OK job using general entropy encoding. However, most data is not text, even log data is not text. Log data is timestamps, numbers, enums, indices, guids, text, etc. Each of those categories, combined with how they are layed out, can be compressed in different optimal ways, resulting in different optimal compression ratios based on your specific log format. As a simple example, compressing 3D meshes as meshes is an order of magnitude better than compressing 3D meshes as text.

Sure, zipping your logs gives you a LOT, and you should do it, but if the result disappoints it is not the end of the road, at all.


Isn't compression the best general purpose solution with theoretical guarantees? I mean a simple huffman coding could easily extract the key-values (where the repeated keys will be candidates for compression) relationship and compress it. But if you want to extract more juice, then that implies knowing the datatypes of strings themselves. It would be like types for logs and not messing those up.


This all might be true for log generating etc.

But someone who has tried to wrangle gazillion row dumps from a variety of old msgpack protobuf etc and make sense of it all will hate it.

Zipped text formats are infinitely easier to own long term and import into future fancy tools and databases for analysis.


Besides space, you can also save CPU when writing logs.

This is a logging library that does lazy formatting, - https://defmt.ferrous-systems.com/


Your feelings are spot on.

In most modern distributed tracing, "observability", or similar systems the write amplification is typically 100:1 because of these overheads.

For example, in Azure, every log entry includes a bunch of highly repetitive fields in full, such as the resource ID, "Azure" as the source system, the log entry Type, the source system, tenant, etc...

A single "line" is typically over a kilobyte, but often the interesting part is maybe 4 to 20 bytes of actual payload data. Sending this involves HTTP overheads as well such as the headers, authentication, etc...

Most vendors in this space charge by the gigabyte, so as you can imagine they have zero incentive to improve on this.

Even for efficient binary logs such as the Windows performance counters, I noticed that second-to-second they're very highly redundant.

I once experimented with a metric monitor that could collect 10,000-15,000 metrics per server per second and use only about 100MB of storage per host... per year.

The trick was to simply binary-diff the collected metrics with some light "alignment" so that groups of related metrics would be at the same offsets. Almost all numbers become zero, and compress very well.


You never send a single individual log event per HTTP request, you always batch them up. Assuming some reasonable batch size per request (minimum ~1MiB or so) there is rarely any meaningful difference in payload size between gzipped/zstd/whatever JSON bytes, and any particular binary encoding format you might prefer.


Most log collection systems do not compress logs as they send them, because again, why would they? This would instantly turn their firehose of revenue cash down to a trickle. Any engineer suggesting such a feature would be disciplined at best, fired at worst. Even if their boss is naive to the business realities and approves the idea, it turns out that it's weirdly difficult in HTTP to send compressed requests. See: https://medium.com/@abhinav.ittekot/why-http-request-compres...

HTTP/2 would also improve efficiency because of its built-in header compression feature, but again, I've not seen this used much.

The ideal would be to have some sort of "session" cookie associated with a bag of constants, slowly changing values, and the schema for the source tables. Send this once a day or so, and then send only the cookie followed by columnar data compressed with RLE and then zstd. Ideally in a format where the server doesn't have to apply any processing to store the data apart from some light verification and appending onto existing blobs. I.e.: make the whole thing compatible with Parquet, Avro, or something other than just sending uncompressed JSON like a savage.


Most systems _do_ compress request payloads on the wire, because the cost-per-byte in transit over those wires is almost always frictional and externalized.

Weird perspective, yours.


They will compress over the wire, but then decompress and ingest counting billing for uncompressed data. After that, an interesting thing will happen, because they will compress the data along other interesting techniques to minimize the size of the data on their premises. Cant blame them... they're just trying to cut costs but the fact that they are charging so much for something that is so easily compressible is just... not fair.


A part of the problem is that the ingestion is not vector compressed, so they're charging you for the CPU overhead of this data rearrangement.

It would cut costs a lot if the source agents did this (pre)processing locally before sending it down the wire.


We should distinct between compression in transit and at rest. Compressing a larger corpus should yield better results in comparison to smaller chunks because dictionaries can be reused (zstd for example)


This is why metrics rule and logging in production need only be turned on to debug specific problems and even then have a short TTL


You got... entirely the wrong message.

The answer to "this thing is horrendously inefficient because of misaligned incentives" isn't to be frugal with the thing, but to make it efficient, ideally by aligning incentives.

Open source monitoring software will eventually blow the proprietary products out of the water because when you're running something yourself, the cost per gigabyte is now just your own cost and not a profit centre line item for someone else.


Unless you start attaching tags to metrics and allow engineers to explode cardinality of the metrics. Then your pockets need to be deep.


I've built a logging system like that, in an embedded context, and defmt (https://github.com/knurling-rs/defmt) is an open-source implementation of the same context. What's most handy about it is that logging continuous sensor data and logging events can both use the same framework.


Side benefit is that you don't need to parse the arbitrary strings to extract information from the logs.


It depends on your use case. If you're looking to have a set of data that you can search through, sure. If you're looking to tail the log while working on things, having it output in plain text is very handy.


This is what metrics are for


How big of a lookup table will you use for logging IPv6 addresses?


> fixed strings + formatting data

IP addresses are obviously part of the formatting data, not the fixed strings.


Why wouldn't IPv6 compress well?

IPv6 is used internally a lot more than externally, so I would expect to see a LOT of commonality in the network ID fraction of the addresses- essentially all the bits of representing your IPv6 network ID get reduced to the number of bits in a compression token, in the worst case. In the moderate case, you get a few chatty machines (DNS servers and the like) where the whole address is converted to a single compression token. In the best case, you get that AND a lot of repetition in the rest of the message, and you reduce most of each message to a single compression token.

It's hard to explain if you haven't actually experimented with it, but modern variants of LZ compression are miraculous. It's like compilers- your intuition tells you hand tuned assembly is better, but compilers know crazy tricks and that intuition is almost always wrong. Same with compressors- they don't look at data the same way you do, and they work way better than your intuition thinks they would.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: