The company that I recently joined uses honeycomb.io, versus ELK at my previous job. Maybe I haven't played with it enough, or maybe we have it badly configured, but I find it a huge step backwards. As far as I know, there's no full text search and the types of reports and aggregation you can do are extremely primitive. The UI is also barely usable..huge horizontal scrolling, can't click on a specific value to apply that sub-filter. The list goes on and on. I wish devops team would focus on these fundamentals (proper centralized logging) before playing with all the "cool" and trendy toys.
As for logging, Our `log_format` directive looks like:
This captures most of the generic data you'd associate with a request (time, host, method, uri, length) as well as the generic response data you'd care about (status, length, time to reply). Note that `$request_time` measures the full time to first byte until time to last byte that nginx spends serving the request, vs $upstream_response_time which is more specific to the upstream's response time.
If you're using nginx as a cache, $upstream_cache_status tells you the cache hit/miss status.
All of our services can set an "x-route" header, which helps canonicalize URIs into something more meaningful. It's up the the service to decide what to do..but /v1/users/ID could be called "users:show". A more complex route might use a different name based on arguments. For example, depending on the parameters we expect some hits to /v1/reports to be very fast, and some to be very slow, so we'll set the name to "reports:list:fast" or "reports:list:slow" so that we can get more accurate statistics.
Finally, we use OpenRestry to do some initial global authentication (along with some other stuff). This is where $client_id comes from. All requests go through something like:
location / {
set $client_id ''
set $upstream ''
access_by_lua_block {
require('execute')()
}
proxy_pass http://$upstream;
}
A piece of that execute code will possibly set $client_id to something which will then get logged.
> As far as I know, there's no full text search and the types of reports and aggregation you can do are extremely primitive.
My understanding is that it's helpful to think of honeycomb.io as a replacement for Scuba -- which, to my understanding, is Facebook's tool for doing fast math and aggregations on metric data even if those metrics come from things that look like logs.
There are some "visualize" things you can do in the Kibana UI which really stretch ELK's capabilities (graphing p99 latency per handler for every one-second bucket of a sample minute in a high-throughput service that writes all its hits to log files, for example). Where Kibana and its backends start to stutter, or when certain kinds of graphs are just impossible, you might begin to be well-served by Scuba-style tools.
Not having worked at Facebook, I have only heard about this secondhand. After a certain scale -- when you have many many requests per second and when handling a request involves a forest of other services doing some work in series or parallel -- you start to feel like you can no longer rely on raw logs and start to lean on tools who do things like distributed tracing or aggregations; if your tool work well, eventually you may even feel a disdain for those raw logs and feel like you're _better_ off using the more advanced tools.
I feel a bit left behind sometimes because I live pragmatically in both worlds (I'm excited to get results rapidly from newer visibility stacks, but honestly am also still happy to log into a system and read its logs -- just in case we ever need to do that).
> I find it a huge step backwards. As far as I know, there's no full text search and the types of reports and aggregation you can do are extremely primitive.
Oof. Honeycomb is for fast, realtime analytics: starting with a high-level question in your mind ("why did our throughput drop by 50%?") and rapidly iterating on a hypothesis (examples in [0]). ELK can... be used for that, but is optimized for another (as you said, full-text search and generating static reports).
Being able to flip from a funny-looking graph directly into "raw data" mode is intended to be a bonus in Honeycomb, not the primary way you interact with your data.
While we believe that fulltext search has its place, beyond a certain point (most production systems, these days), sifting through log lines is a brute-force method of answering questions about your systems — especially if you're not sure what the proverbial needle you're searching for looks like. [1]
(But mherdeg's answer is great, go back and read theirs while you're here :))
As for logging, Our `log_format` directive looks like:
This captures most of the generic data you'd associate with a request (time, host, method, uri, length) as well as the generic response data you'd care about (status, length, time to reply). Note that `$request_time` measures the full time to first byte until time to last byte that nginx spends serving the request, vs $upstream_response_time which is more specific to the upstream's response time.If you're using nginx as a cache, $upstream_cache_status tells you the cache hit/miss status.
A list of all variables is available at http://nginx.org/en/docs/varindex.html
All of our services can set an "x-route" header, which helps canonicalize URIs into something more meaningful. It's up the the service to decide what to do..but /v1/users/ID could be called "users:show". A more complex route might use a different name based on arguments. For example, depending on the parameters we expect some hits to /v1/reports to be very fast, and some to be very slow, so we'll set the name to "reports:list:fast" or "reports:list:slow" so that we can get more accurate statistics.
Finally, we use OpenRestry to do some initial global authentication (along with some other stuff). This is where $client_id comes from. All requests go through something like:
location / { set $client_id '' set $upstream '' access_by_lua_block { require('execute')() } proxy_pass http://$upstream; }
A piece of that execute code will possibly set $client_id to something which will then get logged.