One thing we've done that's greatly improved the effectiveness of our API logs is creating a unique context identifier string for each api call and passing it back in the response headers of a call. This allows you to copy the string from the browser and grep the logs to immediately find the call in question, and if an error occurred.
When I cared about logs on individual servers, I wrote a program to parse them to make it easy to find what I want, and I wrote it as a command line utility so that it could be used in combo with cat, grep, awk, sed, etc. It's on my github: https://github.com/jedberg/quickparse
Splunk/ELK. I have used Splunk since its inception (I was one of the first paying customers) and enjoy its many features and integrations (e.g. AWS Cloudtrail, Nagios, and anomaly detection). In the past few years, I've come to know and like ELK, ElasticSerch, Kibana, and Logstash. The open source approach has some nice properties as well. It's on you to get the logs ingested but generally that is remote syslog (rsyslog,syslog-ng, or equiv) and logstash with redis. The docs on this integration are not super strong but with a little hacking you can get it working. The feature set is less in this stack and the UI is not nearly as nice. The savings are huge though especially as log volume goes up. Splunk also has a free service targeted at developers called Spluk Storm. Good for proof of concept and easy to setup without any hardware requirements as it runs on AWS.
Some people find EFK (Elasticsearch Fluentd Kibana) to be another compelling alternative to Splunk (Disclaimer: I am one of the maintainers of Fluentd)
1. Easier to extend that syslog-ng if you have a modest knowledge of Ruby
2. Easy to configure file- and memory- based buffering and failover.
3. Advanced filtering out of the box.
4. Rich plugin ecosystem with 300+ plugins.
At least that's what I've heard from the users who switched from syslog-ng to Fluentd. I am happy to learn more about what makes syslog-ng great since I've never used it seriously myself =)
I have used both splunk and elk. Splunk at a bank and now elk at a startup. My experience is splunk is not worth the money. You can pretty much do everything u wish to with elk or further processing the data and loading it back to elastic search.. Which is what we do at my current place..
Once Splunk was about to break the bank, we abandoned it and started looking for something in the open-source world.
We've toyed with and pretty much failed using Graylog2. Although it has been coming along steadily in features and stability, we just found that the interface-although pretty-was not intuitive to us: lots of links and multi-click scenarios to get to what you want; and creating filters and streams was difficult and prone to failure.
After watching a couple of very compelling presentations by Jordan Sissel (Logstash founder), we decided to test it out. Once I realized that creating a filter (Grok rocks!) that searched for a term and reorganized the log to my liking only took a couple hours, I was sold.
Another selling point for us was that Logstash has over 2 dozen ways to suck logs in, including the usual suspects - syslog, files, tcp, udp and *mq. You can also perform a bunch of log parsing on the client (i.e. the servers with the logs) before sending them to your central ELK server/cluster.
At the end of the day, there is nothing magical about any of these systems. You alone know your logs best and have to figure out how to read/parse/search them. Our switch to Logstash from Graylog2 was our failing, not Graylog2's.
We use the heck out of GrayLog2. We have a dual datacenter setup with multiple elastic search instances. A coworker of mine did a fantastic job writing up our setup:
I've only maintained a small number of servers, but I've found a good solution in what I'll call the GEL (Graylog2 - Elasticsearch - Logstash) stack. It's been some time since I last used Graylog2, and I can recall that it was somewhat lacking in the pretty charts and graphs department, though that may have improved recently, and the search functioned beautifully.
I'm surprised Graylog2 isn't mentioned more here. We've got tens of terabytes of logs in Graylog2, and I couldn't imagine not using its streams, alerts, and search functionality. It's become a core part of our alerting an monitoring infrastructure.
That's for personal projects, the CloudFlare setup is a little more complex and perhaps one of the data team would be best answering that... if you're interested then I can ping them to see if there's a volunteer for a blog post describing how we do logs at scale.
An ELK stack is definitely a lot easier and straight forward to setup. Since we need to support a dozen or so teams with varying performance considerations we found that logstash left a lot to be desired. In the end fluentd performed well and gave us a lot of flexibility.
I'd like your view from the trenches regarding Logstash vs Fluentd, specifically why it's better for 12+ teams? I'm having to make a similar decision myself and would enjoy your insight! Thanks
Ozge, co-founder of topLog (toplog.io) here. Perhaps my opinion is slightly biased but here is what we have learnt for the last couple of years.
- ELK is great (in fact, we use E+L under the hood) however you really need to know what you're doing with it and you need to spend some time configuring things while putting it together. Do you have that kind of time? Maybe, maybe not...
- Every available tool lets you search and create alerts for monitoring. So, the analysis is always on you. This still takes a lot of manual search time during troubleshooting.
- What if pattern and behaviour detection on your logs can be done automatically? Well, it can be. And that saves you some good amount of time instead of you creating regexs and following the trails to find the root-cause.
I would love to hear your thoughts on automated analysis and anomaly detection on logs. If you're giving a keyword and a specific time frame to search for anomalies, is it real anomaly detection? or a just an improved search?
You should give Echofish atry. It's made wonders on our network with its "whitelisting of normal behaviour". You wont beleive the things you ll discover with this approach.
EDIT: The most fascinating aspect for me is that echofish is more geared towards the actual log entries, rather than statistical analysis, in order to automatically detect anomalies in your logs activity.
Well, its approach (quoting its project page) is pretty simple:
Echofish is a purpose-built solution for filtering & monitoring of syslog activity. By whitelisting regular messages through the web UI, the administrator can instruct the log processing mechanism to create alerts only for anomalies (irregular messages).
...and actually, it can do lots more once you read the built-in help (such as distribution (using BGP) of IP blacklists, consisting of IP addresses collected through syslog activity).
TLDR; It's gearred towards filtering noise from logs. This also means you can possibly have another daemon reporting network activity through syslog, while echofish can act as your noise-filter.
To the people suggesting ELK i just want to ask if you have actually used it in production? Like for real bughunting and investigating support requests?
As much as we absolutely love ElasticSearch for our other indexing needs, we find it quite hard to get the LK-part of the stack to deliver as promised. Kibana may serve up nice graphs and charts, but when you need to drill down into a large amount of log data, we often feel like loosing both overview _and_ detail.
It might very well be that we are to blame, and that we are just doing it wrong (tm) - but I would love to hear how other people are leveraging the ELK stack in production environments?
We use ElasticSearch and Kibana in production for real bughunting and support requests. Logstash was too frustrating to deal with so we wrote our own simple wrapper around an open source ElasticSearch client library to log ourselves.
We log every request (everything but the body usually) and response. If an error occurs, its logged as part of the request. We can practically replay actions taken by users and easily drill down to the exact requests pertaining to an error.
since is a unix utility similar to tail.
Unlike tail, since only shows the lines
appended since the last time. It is useful
to monitor growing log files.
It's in the usual yum/apt repos as well as homebrew for Mac.
It's a bit hard to websearch for because 'since' is a common word.
I regularly search logs across a huge fleet of hosts (thousands). I have a script that will send an arbitrary command in parallel to each host and return the output in a JSON file.
Once I get the logs I'm interested in, it's usually a straightforward combination of jq, grep, sed, awk, cut, sort, uniq, xargs, etc. If I need to do some fancy queries on the data, I have another script that will parse the logs and load them into a SQLite db.
Just forking off processes that SSH into the hosts and write to temp files, which are picked up the by the main process. It's gotten pretty elaborate, but the basics are pretty simple.
As others have said, the classics awk, grep, cut, sort, uniq, and scripting languages (perl/ruby for on the fly one-liners) are great for log analysis. One additional tool I've found particularly useful (three actually) is zgrep/zcat/zless. You'll often be searching through archived gzipped logs, so it's nice being able to work with the files without needing to pipe everything through tar.
Lots of companies use SumoLogic https://www.sumologic.com/signup/.
The UI is very similar to Splunk. But having used both Splunk and SumoLogic, I personally think that Sumo is superior in many ways.
1. It requires very little hacking and setup to work (cough ELK cough Splunk) since Sumo is completely cloud-based. It literally takes 2 minutes to sign up, 5 minutes to download and configure some log collectors, then voila you're ready to send data and search. Also I believe you can tell Sumo Logic to grab logs directly from S3 for example if you're running everything on AWS.
2. SumoLogic is pretty easy to use (it's basically cloud-based grep/awk) and has some really cool features that makes Splunk feel clunky in comparison. Parsing and transposing data for graphing is really simple. Also little things like auto-suggesting sources/hosts while you're typing a query makes the experience much smoother than jumping around tabs copy/pasting shit.
3. If you start to generate a lot of logs, and I mean a metric fuckton of logs from 1000s of servers, Splunk Storm will most definitely not be able to help you. In-house Splunk / ELK clusters will need to be carefully sized (just google ELK sizing).
As software developers, we have enough on our plates that it really pays to use tools that help rather than make you wanna throw up your fists and curse. KISS
This comment seemed incredibly positive for a neutral comment, but didn't disclose a relationship to the company (here or in the HN profile, at least as of this writing). The comment seemed odd as the first-ever comment from a 5+ year old HN account.
Coincidentally, I had a demo call with Sumo today. I was very impressed, except for one point that makes Splunk the clear winner at the moment: I can send structured logs to Splunk and it automatically finds the keys and allows me to query on them immediately.
Meaning, I can send:
2015-02-02 01:00:00 event="Product sold" price=5
And with zero configuration in Splunk, I can now query: event="Product *" price>2 | stats sum(price)
And in the next iteration of my app, I could add 30 more key/value pairs to that message and could query on the rest of them just the same, no configuration. It makes development incredibly rapid to be able to instantly report on any metric anyone on my team logs out, debug-related or otherwise, without having to maintain some master list of every key in every log message in every service we write.
I was floored in my Sumo call today when I was told that wasn't the case in that product yet. It seems like such a basic feature-- and is why many products have switched entirely to JSON-based logs. Have you discovered a workaround, or find that to be as cumbersome as I'm anticipating?
Sumo Logic currently has the ability to extract known fields on ingest, making them available for searches, much like the Splunk query provided above. Dynamic fields, such as new KVPs that are logged out are able to be pulled out in the query with one extra step, as follows:
| kv infer "event","price" | sum(price) by event | where price >2
The kv operator refers to key value pairs. There is also a json operator which functions the same way.
Personally i use Emacs with occur-mode for filtering text also when it finds the pattern it is able to switch from instances quickly.
Sometimes i also use regex with occur to find multiple items.
Also Notepad++ has a Analyze plugin which i recommend for complex stuff and if you dont like emacs.
We use paper trail. I suppose expense is relative, but it's absolutely worth it to us. They've really nailed log search, which is what is really important to us.
Kafka for log aggregation. sed,awk,grep,tail and other bash utilities for analysis. Extremely fast at over 5k lines per second and totally maintenance free running on a 2G AWS instance
ELK is easy to setup but when you really need to analyse logs it can be painful, I think the real problem is Kibana which is just not good enough for that task.
graylog2. they have worked around the shortcomings of elasticsearch for log management (in the end it is a lucene based full text search engine for general purpose tasks) and the 1.0 that is about to go GA soon has crazy stability
We tried logentries, but their agent was a terrible java-application, that was hard to get working right.
The list of files was saved on their service (rather than in a text-file on the server), and the name of our servers was also guessed by their servers which made it hard for us to add and maintain servers.
I think they should build a better agent that embraces UNIX more, and can be configured through a local configuration. Their platform seems nice, but we weren't able to use it, sadly.
There are two agents, the one for Linux/BSD/OSX is Python-based open source [1] (I'm one of the authors) with repositories for Debian/Ubuntu/CentOS/etc. It takes the name of the server from hostname or it can be specified via command line during initialization. Logentries supports syslog for agent-less setups if you don't mind that many syslogs have broken SSL/TLS implementation.
If you have suggestions how to improve it let me know, happy to look at it. Cilent-side configuration an metrics will be available in a week or so.
So the logs being followed are actually configured in a text file on the servers. This makes it super simple for deploying via chef/puppet in large scale environments.
I'm sorry your experience left something to be desired. Given that we deploy across far too many hosts to actually monitor, or, frankly, care about monitoring, I use it only at the application level (logentries.log([string, object, etc[)).
My work recently switched to sumo and I love it. I've you've used splunk, logstash, elastic search + kibana previously and think sumo is the best. It is the right amount of power and simplicity plus great documentation.
How many logs are we talking about, and produced at what rate? These are key questions. If you have tons of logs and you might have to make a full access pass over them, the last thing you want to do is to centralize them on some kind of logs host with lots of disks and few CPUs. But if you have little or moderate amounts of logs you may be able to get away with a single host and some xargs -P grep type of thing.