In the other hand, Influxdb is growing, maybe I'll switch to InfluxDB next year, some features are better than graphite, and it's statsd compatible ;-)
http://influxdb.com/
Development of graphite is effectively dead. The datastore component (carbon and whisper) has design issues, and the official replacement (ceres) hasn't seen any commits this year. There are a some alternatives, though.
For data storage Cyanite [0] speaks the graphite protocol and stores the data in Cassandra. Alternately, InfluxDB [1] speaks the graphite protocol and stores the data in itself
To get the data back out, there's graphite-api [2] which can be hooked up to cyanite [3] or influxdb [4]. You can then connect any graphite dashboard you like, such as grafana [5], to it.
It was originally written and open sourced by Chris Davis of Orbitz. Presumably he either moved on to a new job, got bored with it, or it met his original design goals so he didn't feel the need to completely rewrite the backend just because the internet asked him to.
FWIW I've found influxdb considerably easier to install and manage than graphite (graphite doesn't play well with virtualenv, which makes dependency management horrible, compared to influxdb's single static binary)
Also, I can see logging dictionaries being much more efficient and useful than logging single values -- with graphite if you want to track page hits per section of your site (of which you have 10) per user (100) per browser (5), you end up with 5000 individual metrics, and you need to have thought of them in advance. With influxdb you can log {"section": "front page", "user": "bob", "browser": "firefox", "hits": 1} as a single metric and then use an SQL-like query to filter by section / user / browser (or any combination of those) as and when you want to.
TBH the only thing I miss going from graphite to influxdb+grafana is the tree of metrics (grafana has autocomplete once you start typing, but you can't just browse) and a few of the rendering functions (moving average).
I think the tagging features in the upcoming 0.9.0 release [1] will help with the navigation of metrics. With that we're adding new types of queries to help in discovery. [2]
I've spent the last week working on upgrading our Graphite system. I ultimately killed it and went with InfluxDB. The ease of installation and cluster creation were clear winners.
Additionally the storage options for Influx trump Graphite across the board. I tried writing a custom backend and it went nowhere. The docs and code are terrible. I also noticed that Ceres hasn't had a commit in a year - kind of disheartening.
Graphite's rendering functions are its real beauty. I looked at both OpenTSDB and InfluxDB as replacements, and neither have anything close to the power of Graphite's rendering.
I haven't been thrilled with grafana as a dashboard for OpenTSDB, either. I hadn't seen Cyanite.
How does Graphite, and Influxdb compare to say an ELK stack?
Right now I'm in the process of setting up a Log aggregation/metrics system for PCI logging, and general server stats. Right now I'm using Elasticsearch, with Logstash and Kibana, and then piping OSSEC logs into Elasticsearch for security logging. I like the idea of this ELK stack, but it's not very polished imo.
Graphite is "numbers only." You can't throw logs at it.
Influxdb will be more similar to ElasticSearch, in that it will accept logs and events, and let you query and generate graphs and subsets of the data.
ElasticSearch has more mature clustering/scaling, and last time I checked, much more efficient queries. I don't know if Influx is winning the polish game, but I suspect it depends on your use case.
"If you can't measure it, you can't prove you made it better" is a core value in our organisations' tech culture. We have written an interface in our framework for automatically creating new metrics and it is very easy for a developer to set up a new graph that monitors theirs code.
Now, we have another problem. There are over 2 million metrics in our monitoring system and no one knows what most of them mean. Some graphs have been set up for features that don't exist anymore, other graphs were set up by developers who have already quit, there are lot's of duplicated metrics and in general it is a mess. So we are currently working on this problem. I still would like to mention this is a better problem then not having metrics at all, but still a problem.
If you ever had a similar situation, I will be thankful if you could share your experience on how you solved it.
Obviously, you need metrics to monitor the metrics.
But more seriously, I've attacked this kind of problem in the past by agreeing on a common set of first-look metrics. What is a one page set of data you start with when trying to identify a problem in a given system/subsystem.
After than, other metrics must exist to solve or monitor a specific issue, or be in reusable shared sub-reports with some stated overview. After than, you can think about automatic deletion after a specific timeframe if they're in sandboxes and not connected to some documented useful purpose.
Only way is to make it self documenting, when a metric is set up, when you are navigating the graphs etc there should be some text defining where this data came from. It also depends if lack of data means something to your system. In my system a lack of data for a metric for a time span x implies that something is wrong and should be alerted. In some systems this might not matter. This provides a self fulfilling habit of removing dead metrics
We are working on building a system that combines events and metrics (www.jut.io), and one of the things we've been thinking about is how to store metadata about your metrics. It's not only stuff like min/max values, but almost anything you could think of - developer, feature name, expected frequency, etc. Would a concept like this be useful?
I haven't had the best experience with Graphite. Namely, our main systems practically never crash but Graphite does fall over every few months. Seriously, Graphite is less reliable than the systems we use it to monitor. Furthermore, there hasn't been a release in about 2 years which makes me think the project is dead.
We've had the same problems. Unfortunately we have so many metrics being stored in Graphite that we've come to rely on it for different monitoring tools. This post has some great information on how to scale out Graphite to hopefully make it more stable and efficient for you: https://grey-boundary.io/the-architecture-of-clustering-grap...
I ended up rolling my own replacement. My biggest problem with Graphite was that it managed to grind an expensive large RAID array into the ground with a relatively small number (in my eyes) of metrics. We had the realisation that we'd waste a tremendous amount of hardware or have to cut down drastically on our data collection if we were to roll out Graphite across the board.
(And yes, we had crashes too)
The reason for the disk grinding was simple: The whisper storage system is ridiculously inefficient as it does tiny writes all over the places, and an excessive number of system calls to boot.
In our case, I decided we don't care if we lose some data if a metric server crashes - if it becomes an issue we'll run two or more vms on separate hardware and feed half our samples into each -, so the first step was to write a simple statsd replacement that shovels 10 second intervals of data into Redis with disk snapshots turned off, coupled with a small daemon that rolls up (I've hardcoded roll-up intervals as it made it easy to use naming of the keys to use "keys <timestamp for start of each interval to roll up>-<postfix for type of period e.g. we use 10 second then 5 minutes, then hourly>-*" to retrieve the keys of the objects to process each step).
We could've easily beat Carbon/Graphite on the same system just by doing more efficient disk writes, but since we were first going to replace it I figured I might as well keep things in memory.
Then a tiny replacement for the subset of the Graphite HTTP API we used for our graphing (if we'd relied on Graphite itself for our dashboards I'd have thought twice about this...).
Lastly a tiny process that archives a final roll-up of data past 48 hours (currently) to CouchDB for if/when we need to do longer term historical trending.
I keep wanting to talk to our commercial director about letting me release some of this code, though a lot of it is probably too specific to our needs to be all that useful to others (e.g. as mentioned, we only support a tiny subset of the functionality of the Graphite HTTP API, as I've only cared about being able to do the averaging and filtering etc. that we actually use). In general, though, if you don't use Graphite for the actual dashboard, replacing it is surprisingly little work.
I ended up setting up one machine with a 80GB tmpfs mount for the graphite data, and then rsync it to disk every hour. That allows carbon-cache to keep up, but I'm not happy with the setup.
if you use collectd to feed values into graphite, you've got the advantage of its bulk writes. This article describes how its solved for rrd only collectd installations: https://collectd.org/wiki/index.php/Inside_the_RRDtool_plugi... but the effect also becomes visible if you use it to send values to graphite.
Its also good in reducing the amount of data you need to send to graphite.
How do you figure? Unless recent versions of Whisper have been totally rewritten, whisper writes each metric to a separate file. Submit hundreds of metric per vm/server every 10 seconds, and you get ridiculous amounts of tiny writes (e.g. 4 byte writes) fenced by redundant seek()'s and a number of other syscalls, no matter how much you batch up stuff before sending it to statsd.
I came up with the same workaround myself; then switched to influxdb, and my monitoring server's io-wait is still at 0% even with twice as many metrics coming in :)
Graphite is a great tool, but the graphs can be a bit ugly and changing time can be a bit annoying. We use Grafana (http://grafana.org/) for a nicer frontend to Graphite.
Being a Windows shop, we had no interest in having a Linux box running with Graphite/StatsD so we went ahead and essentially ported Graphite/StatsD/CollectD to .NET/C#. We'll be be open-sourcing this toolset soonish.
I would love software developers stopping their use of names of real things for their software apps. It is so confusing.
It is like they have material world envy or something.
Graphite is already something very common. It makes it hard for people to search in a search engine, and in headlines like this it confuses the hell out of normal people.
That's kind of how a lot of new products are named, in and out of tech. It can be difficult to create a new word that's agreeable in the way it sounds, and preferably carries a little meaning tied to the product.
The search engine issue is a very real problem, I agree. Like in the early days of the Go programming language. Eventually the indexes got better, especially if you gave a little more context to the search (ie. "http client libraries for go"). And, of course, learning early on that it was going to be a known problem so I started using golang in my searches.
It is quite difficult to come up with a good catchy name that has no meaning in the real world, and yet is easy to memorize.
Having said that, I agree with you (and don't understand why people were down-voting, you make a perfectly valid point?). I had real trouble a while back searching how to use "chef knife" and how to write "chef recipes".
Graphite is awesome:
We graph lots and lots of things. we have two datastore servers, and each of them has a static write load of about 60 megs a second (bear in mind that each update is less than 100bytes) we have many thousands of updates a second.
But why is it awesome? because it almost eliminates the need for log shipping. 90% of the time we can diagnose most problems with just graphs. Something is running slow? well we can see which server it is by looking at load, response time and queue size.
Because we are not doing silly things like parsing logs to gain metrics, we don't need a hadoop system. We plumb metrics collection directly into Cgroups (the primitive that docker uses) so we can get per process metrics (disk, memory, cpu etc)
The only time we need logs are when we are really stuck, or diagnosing a specific issue like "why" it went wrong, not what has gone wrong.
Does anyone know of any good solutions for feature extraction from logs, for the purpose of making graphs out of these features?
Has anyone integrated a graphing system like this in to Graylog/Logstash and care to share some lessons or advice on how you did it?
Finally, monitoring suites like Nagios an Zabbix: they have their own data-collection and graphing features. When tools like Graphite are used in conjunction with these, would you bypass them entirely? Or would you leverage their data-collection features to first grabb the data and then somehow funnel that data in to Graphite?
If it's the former, what do you use instaed of all thosebuilt-in data collection tools? If it's the latter, how do you do it?
When I used Graphite with Nagios, I bypassed all the Nagios data-collection and graphing features. Instead, I funneled all the data into Graphite and used check-graphite to alert on it:
Thanks for this link. Most of the data I deal with is in relational SQL databases so this looks very promising for my use cases. Maybe its just me but most of the graphing libraries and APIs are geared towards JSON data.
It should be mentioned that there is a huge bunch of various implementations of the statsd pattern; depending on ones existing infrastructure one may prefer one or the other. Heres a comprehensive list of them: http://www.joemiller.me/2011/09/21/list-of-statsd-server-imp...
I very much like the way in which metrics 2.0 enhances the duett of collectd and graphite ( http://metrics20.org/ ) See the amazing video how you can select across the data in the cluster of dieterbe's employer.
Grafana from my POV is the best dashboard at the moment: http://grafana.org/ http://grafana.org/blog/2014/05/25/monitorama-video-and-upda... http://play.grafana.org/
About alert system I'm using cabot: http://cabotapp.com/
About system metrics, at the moment I'm using Diamond. https://github.com/BrightcoveOS/Diamond
In the other hand, Influxdb is growing, maybe I'll switch to InfluxDB next year, some features are better than graphite, and it's statsd compatible ;-) http://influxdb.com/
Regards ;-)