Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: How do you analyze logs?
72 points by hckrt_ on Jan 31, 2015 | hide | past | favorite | 74 comments
Analyzing logs is huge part of a software problem solver's job. What kind of tools and techniques do you use to be more effective in your job?


One thing we've done that's greatly improved the effectiveness of our API logs is creating a unique context identifier string for each api call and passing it back in the response headers of a call. This allows you to copy the string from the browser and grep the logs to immediately find the call in question, and if an error occurred.


When I cared about logs on individual servers, I wrote a program to parse them to make it easy to find what I want, and I wrote it as a command line utility so that it could be used in combo with cat, grep, awk, sed, etc. It's on my github: https://github.com/jedberg/quickparse


Splunk/ELK. I have used Splunk since its inception (I was one of the first paying customers) and enjoy its many features and integrations (e.g. AWS Cloudtrail, Nagios, and anomaly detection). In the past few years, I've come to know and like ELK, ElasticSerch, Kibana, and Logstash. The open source approach has some nice properties as well. It's on you to get the logs ingested but generally that is remote syslog (rsyslog,syslog-ng, or equiv) and logstash with redis. The docs on this integration are not super strong but with a little hacking you can get it working. The feature set is less in this stack and the UI is not nearly as nice. The savings are huge though especially as log volume goes up. Splunk also has a free service targeted at developers called Spluk Storm. Good for proof of concept and easy to setup without any hardware requirements as it runs on AWS.


Some people find EFK (Elasticsearch Fluentd Kibana) to be another compelling alternative to Splunk (Disclaimer: I am one of the maintainers of Fluentd)

http://docs.fluentd.org/articles/free-alternative-to-splunk-...


What makes Fluentd better than syslog-ng?


It is not "better" but different.

1. Easier to extend that syslog-ng if you have a modest knowledge of Ruby

2. Easy to configure file- and memory- based buffering and failover.

3. Advanced filtering out of the box.

4. Rich plugin ecosystem with 300+ plugins.

At least that's what I've heard from the users who switched from syslog-ng to Fluentd. I am happy to learn more about what makes syslog-ng great since I've never used it seriously myself =)


I have used both splunk and elk. Splunk at a bank and now elk at a startup. My experience is splunk is not worth the money. You can pretty much do everything u wish to with elk or further processing the data and loading it back to elastic search.. Which is what we do at my current place..


Once Splunk was about to break the bank, we abandoned it and started looking for something in the open-source world.

We've toyed with and pretty much failed using Graylog2. Although it has been coming along steadily in features and stability, we just found that the interface-although pretty-was not intuitive to us: lots of links and multi-click scenarios to get to what you want; and creating filters and streams was difficult and prone to failure.

After watching a couple of very compelling presentations by Jordan Sissel (Logstash founder), we decided to test it out. Once I realized that creating a filter (Grok rocks!) that searched for a term and reorganized the log to my liking only took a couple hours, I was sold.

Another selling point for us was that Logstash has over 2 dozen ways to suck logs in, including the usual suspects - syslog, files, tcp, udp and *mq. You can also perform a bunch of log parsing on the client (i.e. the servers with the logs) before sending them to your central ELK server/cluster.

At the end of the day, there is nothing magical about any of these systems. You alone know your logs best and have to figure out how to read/parse/search them. Our switch to Logstash from Graylog2 was our failing, not Graylog2's.



I've only maintained a small number of servers, but I've found a good solution in what I'll call the GEL (Graylog2 - Elasticsearch - Logstash) stack. It's been some time since I last used Graylog2, and I can recall that it was somewhat lacking in the pretty charts and graphs department, though that may have improved recently, and the search functioned beautifully.


I'm surprised Graylog2 isn't mentioned more here. We've got tens of terabytes of logs in Graylog2, and I couldn't imagine not using its streams, alerts, and search functionality. It's become a core part of our alerting an monitoring infrastructure.


logstash + elasticsearch + kibana

That's for personal projects, the CloudFlare setup is a little more complex and perhaps one of the data team would be best answering that... if you're interested then I can ping them to see if there's a volunteer for a blog post describing how we do logs at scale.


An ELK stack is definitely a lot easier and straight forward to setup. Since we need to support a dozen or so teams with varying performance considerations we found that logstash left a lot to be desired. In the end fluentd performed well and gave us a lot of flexibility.



> we found that logstash left a lot to be desired. In the end fluentd performed well and gave us a lot of flexibility.

Would you mind providing more details about why you felt that Logstash was lacking?


I'd like your view from the trenches regarding Logstash vs Fluentd, specifically why it's better for 12+ teams? I'm having to make a similar decision myself and would enjoy your insight! Thanks


(Disclaimer: I am a Fluentd maintainer)

Here are some comparisons out in the wild (Note: neither of them is a Logstash or Fluentd maintainer afaik)

- http://jasonwilder.com/blog/2013/11/19/fluentd-vs-logstash/

- https://blog.deimos.fr/2014/05/13/logstash-vs-fluentd/

Also, if you have any question or doubt about Fluentd, please feel free to email me at kiyoto@treasure-data.com


That would be very helpful indeed...thanks in advance!


I'd be very interested to read that kind of post.


A post like that would be awesome


http://lnav.org -- A tool I wrote and use everyday to view/analyze the logs of the software I work on.



Ozge, co-founder of topLog (toplog.io) here. Perhaps my opinion is slightly biased but here is what we have learnt for the last couple of years.

- ELK is great (in fact, we use E+L under the hood) however you really need to know what you're doing with it and you need to spend some time configuring things while putting it together. Do you have that kind of time? Maybe, maybe not...

- Every available tool lets you search and create alerts for monitoring. So, the analysis is always on you. This still takes a lot of manual search time during troubleshooting.

- What if pattern and behaviour detection on your logs can be done automatically? Well, it can be. And that saves you some good amount of time instead of you creating regexs and following the trails to find the root-cause.

I would love to hear your thoughts on automated analysis and anomaly detection on logs. If you're giving a keyword and a specific time frame to search for anomalies, is it real anomaly detection? or a just an improved search?


You should give Echofish atry. It's made wonders on our network with its "whitelisting of normal behaviour". You wont beleive the things you ll discover with this approach.

EDIT: The most fascinating aspect for me is that echofish is more geared towards the actual log entries, rather than statistical analysis, in order to automatically detect anomalies in your logs activity.


Is echofish geared towards network activity norm / abnorm or general logs (syslog, app / dev logs, etc)?

Sounds cool.


Well, its approach (quoting its project page) is pretty simple:

Echofish is a purpose-built solution for filtering & monitoring of syslog activity. By whitelisting regular messages through the web UI, the administrator can instruct the log processing mechanism to create alerts only for anomalies (irregular messages).

...and actually, it can do lots more once you read the built-in help (such as distribution (using BGP) of IP blacklists, consisting of IP addresses collected through syslog activity).

TLDR; It's gearred towards filtering noise from logs. This also means you can possibly have another daemon reporting network activity through syslog, while echofish can act as your noise-filter.


Did monitoring logs with Bayesian filters ever catch on? They are very good and finding things that are "off"


By "monitoring" do you mean anomaly detection?


thats part of how our engine works.. part (full disclosure, co-founder of toplog.io)


To the people suggesting ELK i just want to ask if you have actually used it in production? Like for real bughunting and investigating support requests?

As much as we absolutely love ElasticSearch for our other indexing needs, we find it quite hard to get the LK-part of the stack to deliver as promised. Kibana may serve up nice graphs and charts, but when you need to drill down into a large amount of log data, we often feel like loosing both overview _and_ detail.

It might very well be that we are to blame, and that we are just doing it wrong (tm) - but I would love to hear how other people are leveraging the ELK stack in production environments?


We use ElasticSearch and Kibana in production for real bughunting and support requests. Logstash was too frustrating to deal with so we wrote our own simple wrapper around an open source ElasticSearch client library to log ourselves.

We log every request (everything but the body usually) and response. If an error occurs, its logged as part of the request. We can practically replay actions taken by users and easily drill down to the exact requests pertaining to an error.



awk, grep, sed, ag, cut, sort,etc for most of the stuff. Elasticsearch, logstash, kibana for more complex queries


We use splunk but there's something to be said for the low-tech approach for ad-hoc queries.

A great little tool is 'since' - a stateful tail. http://welz.org.za/projects/since

  since is a unix utility similar to tail.
  Unlike tail, since only shows the lines
  appended since the last time. It is useful
  to monitor growing log files.
It's in the usual yum/apt repos as well as homebrew for Mac.

It's a bit hard to websearch for because 'since' is a common word.


I regularly search logs across a huge fleet of hosts (thousands). I have a script that will send an arbitrary command in parallel to each host and return the output in a JSON file.

Once I get the logs I'm interested in, it's usually a straightforward combination of jq, grep, sed, awk, cut, sort, uniq, xargs, etc. If I need to do some fancy queries on the data, I have another script that will parse the logs and load them into a SQLite db.


What are you using to execute commands on thousands of hosts in parallel?


Just forking off processes that SSH into the hosts and write to temp files, which are picked up the by the main process. It's gotten pretty elaborate, but the basics are pretty simple.


I don't know what the GP's answer will be, but I would use pdsh.


As others have said, the classics awk, grep, cut, sort, uniq, and scripting languages (perl/ruby for on the fly one-liners) are great for log analysis. One additional tool I've found particularly useful (three actually) is zgrep/zcat/zless. You'll often be searching through archived gzipped logs, so it's nice being able to work with the files without needing to pipe everything through tar.


Lots of companies use SumoLogic https://www.sumologic.com/signup/. The UI is very similar to Splunk. But having used both Splunk and SumoLogic, I personally think that Sumo is superior in many ways.

1. It requires very little hacking and setup to work (cough ELK cough Splunk) since Sumo is completely cloud-based. It literally takes 2 minutes to sign up, 5 minutes to download and configure some log collectors, then voila you're ready to send data and search. Also I believe you can tell Sumo Logic to grab logs directly from S3 for example if you're running everything on AWS.

2. SumoLogic is pretty easy to use (it's basically cloud-based grep/awk) and has some really cool features that makes Splunk feel clunky in comparison. Parsing and transposing data for graphing is really simple. Also little things like auto-suggesting sources/hosts while you're typing a query makes the experience much smoother than jumping around tabs copy/pasting shit.

3. If you start to generate a lot of logs, and I mean a metric fuckton of logs from 1000s of servers, Splunk Storm will most definitely not be able to help you. In-house Splunk / ELK clusters will need to be carefully sized (just google ELK sizing).

As software developers, we have enough on our plates that it really pays to use tools that help rather than make you wanna throw up your fists and curse. KISS


Do you work for Sumo Logic? https://www.linkedin.com/pub/rong-hu/8/140/75 is a "Rong Hu"

This comment seemed incredibly positive for a neutral comment, but didn't disclose a relationship to the company (here or in the HN profile, at least as of this writing). The comment seemed odd as the first-ever comment from a 5+ year old HN account.


Coincidentally, I had a demo call with Sumo today. I was very impressed, except for one point that makes Splunk the clear winner at the moment: I can send structured logs to Splunk and it automatically finds the keys and allows me to query on them immediately.

Meaning, I can send: 2015-02-02 01:00:00 event="Product sold" price=5

And with zero configuration in Splunk, I can now query: event="Product *" price>2 | stats sum(price)

And in the next iteration of my app, I could add 30 more key/value pairs to that message and could query on the rest of them just the same, no configuration. It makes development incredibly rapid to be able to instantly report on any metric anyone on my team logs out, debug-related or otherwise, without having to maintain some master list of every key in every log message in every service we write.

I was floored in my Sumo call today when I was told that wasn't the case in that product yet. It seems like such a basic feature-- and is why many products have switched entirely to JSON-based logs. Have you discovered a workaround, or find that to be as cumbersome as I'm anticipating?


Sumo Logic currently has the ability to extract known fields on ingest, making them available for searches, much like the Splunk query provided above. Dynamic fields, such as new KVPs that are logged out are able to be pulled out in the query with one extra step, as follows:

| kv infer "event","price" | sum(price) by event | where price >2

The kv operator refers to key value pairs. There is also a json operator which functions the same way.


SumoLogic is Splunk Jr.


Personally i use Emacs with occur-mode for filtering text also when it finds the pattern it is able to switch from instances quickly. Sometimes i also use regex with occur to find multiple items.

Also Notepad++ has a Analyze plugin which i recommend for complex stuff and if you dont like emacs.


I second emacs occur.


Kibana provide an excellent way to track log.

Currently handle +100gb/day of heavy document (more or less 100 items per document), on our current setup. And probably designed to handle way mooore.

Dashboard are constantly opened on +50 screens and we use them also to track MySQL, mailing, internal stats...


Used papertrailapp.com but as the number of servers grew, this became expensive.

Moved to EL + Kibana, but not liking the interface yet and it doesn't seem to have 'tail -f' kindof functionality.


We use paper trail. I suppose expense is relative, but it's absolutely worth it to us. They've really nailed log search, which is what is really important to us.


Kafka for log aggregation. sed,awk,grep,tail and other bash utilities for analysis. Extremely fast at over 5k lines per second and totally maintenance free running on a 2G AWS instance


ELK is easy to setup but when you really need to analyse logs it can be painful, I think the real problem is Kibana which is just not good enough for that task.

- no logs coloration

- a lot of bugs / weird refresh behavior

- no auth


Do you know of a better alternative to Kibana for a EL_ stack?


At Mozilla Opsec, we use (and write) MozDef: https://github.com/jeffbryner/MozDef/


Amazing name


Splunk. Here's my latest Splunk log analytics app:

http://www.mensk.com/#prettyPhoto/0/


graylog2. they have worked around the shortcomings of elasticsearch for log management (in the end it is a lucene based full text search engine for general purpose tasks) and the 1.0 that is about to go GA soon has crazy stability

https://www.graylog2.org/


LogEntries.com


We tried logentries, but their agent was a terrible java-application, that was hard to get working right.

The list of files was saved on their service (rather than in a text-file on the server), and the name of our servers was also guessed by their servers which made it hard for us to add and maintain servers.

I think they should build a better agent that embraces UNIX more, and can be configured through a local configuration. Their platform seems nice, but we weren't able to use it, sadly.


There are two agents, the one for Linux/BSD/OSX is Python-based open source [1] (I'm one of the authors) with repositories for Debian/Ubuntu/CentOS/etc. It takes the name of the server from hostname or it can be specified via command line during initialization. Logentries supports syslog for agent-less setups if you don't mind that many syslogs have broken SSL/TLS implementation.

If you have suggestions how to improve it let me know, happy to look at it. Cilent-side configuration an metrics will be available in a week or so.

[1] https://github.com/logentries/le/


Hey - You might want to check the agent docs here which have a local configuration file: https://github.com/logentries/le#configuration

So the logs being followed are actually configured in a text file on the servers. This makes it super simple for deploying via chef/puppet in large scale environments.


I'm sorry your experience left something to be desired. Given that we deploy across far too many hosts to actually monitor, or, frankly, care about monitoring, I use it only at the application level (logentries.log([string, object, etc[)).


Grep, awk, sed for the easy stuff, logastash/elasticsearch/kibana for the harder stuff.


logstash


fluentd -> elasticsearch/kibana

Works pretty damn well.


www.sumologic.com


My work recently switched to sumo and I love it. I've you've used splunk, logstash, elastic search + kibana previously and think sumo is the best. It is the right amount of power and simplicity plus great documentation.


I have to second this. Sumologic is very powerful and saves hours of time troubleshooting.


grep


How many logs are we talking about, and produced at what rate? These are key questions. If you have tons of logs and you might have to make a full access pass over them, the last thing you want to do is to centralize them on some kind of logs host with lots of disks and few CPUs. But if you have little or moderate amounts of logs you may be able to get away with a single host and some xargs -P grep type of thing.


Ship JSON to ElasticSearch, visualize with Kibana, and backup to S3/glacier.

We use found.on and qbox.io for managed hosting of ElasticSearch clusters.


Logstash/kibana (free, open source)

Splunk (good, but ridiculously expensive)


You can do quite a lot with Splunk Storm[0] for free.

[0] https://www.splunkstorm.com/


And if you prefer a non-cloud option, there's a Splunk Free license [1] which allows for indexing of up to 500MB logs per day.

[1] http://docs.splunk.com/Documentation/Splunk/6.2.1/Admin/More...


splunk




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: