Netdata – Linux performance monitoring, done right

brendangregg · on March 30, 2016

Looks like another faster horse. A pretty GUI on /proc is not the most burning issue to solve in Linux performance monitoring. I wish anyone making these tools would spend 30 minutes watching my Monitorama talk about instance monitoring requirements at Netflix: http://www.brendangregg.com/blog/2015-06-23/netflix-instance... . I still hate gauges.

Where is the PMC support? At Facebook a few days ago, they said their number one issue was memory bandwidth. Try analyzing that without PMCs. You can't. And that's their number one issue. And it shouldn't be a surprise that you need PMC access to have a decent Linux monitoring/analysis tool. If that's a surprise to you, you're creating a tool without actual performance expertise.

Should front BPF tracing as well... Maybe it will in the future and I can check again.

0xbadcafebee · on March 30, 2016

First off, I think this is actually designed mainly for embedded applications. It's the only reason I can think of that they'd make a single self-contained monitor that only works for one host.

Second, it's great that you want a car instead of a faster horse, but I want 10 horses. Why?

  - 1 horse trained for 1 job easier & more reliable than 1 horse/car trained for 10
  - 10 horses more flexible than 1 horse or 1 car
  - if a horse isn't doing its job, shoot it and replace it
  - don't need to go to horse school to learn how to use or maintain one
  - don't need to hire an engineer to teach horse new trick
  - most problems can be solved with a horse

I get that with some applications, having a couple extra minutes saved by your monitoring system can mean millions of dollars saved. For those cases, there will always be custom-tailored solutions because there is value there. For 99% of the rest of the time, you get more value from simple tools that can be combined to do their jobs well.

If instead of focusing on the transportation we were focusing on the road, we could build a foundation for horses and cars to coexist. I'd love to see more protocols and specifications for each kind of monitor (other than syslog, RRD & SNMP) so they could simply be made according to spec and we could mix and match as we wished.

X86BSD · on March 30, 2016

They are more than welcome to use FreeBSD's PMC support to roll their own implementation. But I predict they won't until another decade of beating their heads against broken tools and non engineered solutions, light through yonder window will finally break.

lazyant · on March 30, 2016

Unrelated: In the middle of your Performance book, wish I had studied it several years ago.

For this tool I wonder what the observer's effect is.

zokier · on March 30, 2016

Just curious; how closely you are involved in Vector development? Do you think Vector tackles the issues you mention?

brendangregg · on March 30, 2016

Yes, I'm involved, and no, I don't think the current public release of Vector tackles many of the issues yet. But we've been working on them and will have it released when we can. Items include:

- PMCs: Vector's backend is pcp, which has a Linux perf_events pmda for PMC support. There's no PMC counters by default in Vector since we're using it in an environment without PMC access. But I've been heavily working in this area. More later on.

- Flame graph support: we already have it and use it. Ongoing work includes new flame graph types, and a rewritten flame graph implementation in d3. Need to get it all published.

- Heat map support: just solved an issue with them; again, an area where we have ad hoc tools that work and bring value, but haven't wrapped it all in Vector yet.

- BPF support: I've been prototyping many new metrics that we want in Vector, https://github.com/iovisor/bcc#tools. There'll be increased demand to get Vector accessing these in a couple of months or so, when we have newer kernels in production that have BPF.

stelfer · on March 30, 2016

Yep if you aren't looking at eBPF now you probably aren't doing it right. And BTW thanks for that Brendan.

sputr · on March 30, 2016

Don't use the red-green combination in charts as it makes it really hard to read for those of us with a degree of red-green color blindness (which is the most common type in the ~5% of the male and ~1% female population that has it).

Other than that it looks AWESOME.

woodman · on March 30, 2016

The female rate is closer to 0.5%, the male rate varies wildly based on race [0]:

Fiji Islanders: 0.8% vs Arab: 10%

So poor design is racist :)

[0] https://en.wikipedia.org/wiki/Color_blindness#Frequency_of_r...

zymhan · on March 30, 2016

Wait, I'm more likely to be color blind because of my ethnicity? Well that's interesting...

EDIT: The article says "Arabs (Druzes)". So, if they're specifying the Druze ethnicity, does that have any implications for Arabs in general?

mjevans · on March 30, 2016

They probably don't know. It's very likely that a subsample of that ethnicity had good medical data and thus conclusions can only be drawn conclusively for that subsample. However it would indicate an area for study should more reliable sample sizes of other 'near' populations be possible.

andrewstuart2 · on March 30, 2016

I would say rather, provide an option to change colors for those with various types of color blindness. That way it's accessible but still intuitive to those who can discern and are familiar with typical meanings of red and green.

dsr_ · on March 30, 2016

Well, it's pretty. It's probably great if you have one to five machines you care about, or you really want a pretty dashboard.

Notable features that I would need all relate to multi-server usage:

- central config across hosts

- alerting when values go over or under thresholds

- a mode for automatically selecting and viewing the machines which are working hardest, or not working

- a mode for viewing of a few stats across all machines

- a mode for slide-show viewing of a few stats across all machines

izacus · on March 30, 2016

Yes, for us that do have one to five machines this is awesome, because most other monitoring solutions are really annoying to deploy because they presume more than five machines :)

lazylizard · on March 31, 2016

how about amon.cx or mmonit? or if there's no need to self host, newrelic?

toong · on March 30, 2016

Maybe "monitoring" is a poor word choice when there are no alerting/notification capabilities or statics-rollup. But there are different tools out there to do just that.

This seems super useful to zoom in one some performance characteristics of a single server to debug some issues!

lawrencegs · on March 31, 2016

Agreed, this is nicer way to see what's happening when something happen.. and if you only have one server haha.

For historical "monitoring", alerts, I still need my NewRelic / ELK kind of tools.

lawrencegs · on March 31, 2016

so from "Introducing-netdata" on their Wiki, there's this excerpt..

With netdata, there is no need to centralize anything for performance monitoring. You view everything directly from their source. Still, with netdata you can build dashboards with charts from any number of servers. And these charts will be connected to each other much like the ones that come from the same server.

So seems like this is on purpose. And if I understand this correctly, I can just build custom HTML dashboard to connect to multiple machines?

manigandham · on March 30, 2016

There's also https://github.com/opserver/Opserver

faebser · on March 30, 2016

check out http://riemann.io/. You just have to build your own dashboard.

otterley · on March 30, 2016

Your casual tone makes it sound so easy. If only that were true!

adyus · on March 30, 2016

I believe he may have meant "glue one together from parts found on Github".

Case in point: https://github.com/Shopify/dashing

Karunamon · on March 30, 2016

Not to knock on this project, I use it and love it, but the learning curve involved here is not trivial. Using it basically requires that you're an accomplished frontend developer, and all this to just display some metrics on a page.

Here's the presumed knowledge:

    * Ruby
    * Sinatra
    * CofeeScript
    * JavaScript
    * HTML
    * CSS
    * SCSS
    * Rufus
    * Sprockets

Nine different languages and libraries! Jesus H. Christ!

It's plain to see why outfits like Splunk can get away with charging as much as they do - visualizing metrics in that app is as simple as installing a deb package, logging in, and pointing it at your data source.

Setting up Dashing by hand is comparatively ...difficult.

otterley · on March 30, 2016

This is why we pay companies like Datadog and SignalFX to do it right.

zymhan · on March 30, 2016

...or merrily trudge along with Zabbix, or curse under the cruel rule of Nagios...

monitorman · on March 30, 2016

Wavefront does not do a lot of marketing right now, but they have a great product

doublerebel · on March 30, 2016

You forgot Clojure. Riemann's monitoring config is written in Clojure.

There's a much more approachable all-JS Riemann clone called Godot that I've successfully used on multiple projects, but it still requires some work to make the frontend look good.

atmosx · on March 30, 2016

I saw the list and thought "hey I can manage!" ... Then Clojure came along...

sagichmal · on March 30, 2016

https://github.com/firehol/netdata/wiki/Installation#nodejs

> I believe the future of data collectors is node.js

:(

illumin8 · on March 30, 2016

I'm sure most of us can assume reasons why collecting data with node.js might be "wrong," but it would be more helpful to the conversation if you would spell out specific reasons why using node.js for this use case is not optimal, instead of just commenting with a single emoji.

otterley · on March 30, 2016

Because it's just not necessary, and meanwhile most experienced SREs and performance engineers are imperative programmers. We don't see the benefits of using JavaScript on our servers, and that's rather putting it mildly.

Also JavaScript doesn't have a 64-bit integer type, which is absolutely necessary to properly support large counters. There are workarounds but the fact that you need one is ridiculous.

Many of us believe that the only reason one would want to use JavaScript is because you're doing browser programming and therefore you have no other choice.

illumin8 · on March 30, 2016

I've seen plenty of server-based applications written in node.js. Just because javascript used to be pigeon-holed into front-end doesn't mean it can't perform well on server applications.

A good example of a node.js app that is purely server side is Hubot, a chat bot created at Github and widely used in Slack, HipChat, IRC, etc. I'm sure there are thousands of others out there, and I don't believe being written in Javascript gives them any fundamental disadvantage to server applications written in Python, Java, C++, or any other language.

otterley · on March 30, 2016

"If you have a hammer, everything starts to look like your thumb."

JavaScript is reasonably suited to servers written around event loops like Hubot. But as a general purpose programming language I think it's reasonable to argue that it's pretty bad compared to alternatives like Erlang and Go in the space it's used in. Also debugging is a pain in the ass because stack traces aren't useful at all. (That's not limited to JS but it does make my life harder.)

In any event, we're talking about a telemetry collector, which is generally a trivial piece of code that simply doesn't need whatever benefits JS purports to provide.

pritambaral · on March 30, 2016

For systems programming though? Running on a server is not the only similarity you should be looking at.

NietTim · on March 30, 2016

Non of what you just said was an actual argument against node.js though. All you said was "Well, we don't like it"

Why don't you like it? What's wrong with it? What would be better?

otterley · on March 30, 2016

It's not about Node per se; it's about JavaScript. System telemetry collectors don't benefit much from languages built around event loops.

Many of us already know C, Bourne Shell and Python and probably another scripting language and all of those can get telemetry data off systems and into an event bus or metrics aggregator quickly enough. Adding JavaScript adds complexity without giving us significant new functionality.

Now the data collector (gathering metrics from many senders for aggregation and storage) is a different story. I've seen them written in JavaScript but the 64-bit counter difficulties would rule it out for me were I to implement one again.

jjnoakes · on March 30, 2016

> All you said was "Well, we don't like it"

That's not what I see.

crdoconnor · on March 30, 2016

It's a weakly typed language and it exhibits all kinds of bizarre not-very-well-thought out behaviors.

c.f. the disaster with npm a few days ago.

These problems are particularly pernicious for large scale apps - which this isn't, currently, but probably aspires to be one day.

Any strongly typed language would be better (python, go, ruby, etc.).

geofft · on March 30, 2016

Can you explain how the npm incident is related to typing? I am an advocate of strong typing and love seeing it be the answer to all sorts of problems (e.g. concurrency), but I'm not understanding the connection here.

crdoconnor · on March 30, 2016

That wasn't related to js's weak typing and I didn't say that it was, but it's indicative of the types of other problems that will occur.

pritambaral · on March 30, 2016

Not parent, but IMHO requiring nodejs is heavy for something as simple as data collection. I wouldn't mind if plugins (sufficiently complex ones) can be written in nodejs, but requiring nodejs is too much.

I'd prefer something like Lua (and C, of course), or even Python (which is installed on most systems anyway, but still too heavy).

clarkevans · on March 30, 2016

Is there such a thing as 95% threshold CPU monitoring?

Consider an application spikes (close to 100% on a core) for 2-3s on some web requests -- let's assume this is normal (nothing can be done about it). Now, let's consider the average user of the system is idle for 2 minutes per web request. So, users won't see performance degradation unless $(active-users) > $(cores) during a 2-3 minute window.

For most monitoring systems, CPU is reported as an average over a minute, and, even if it's pinned only 2-3s per 60s, that's only 5% usage. Presume a 2 CPU system with 5 users, who all happen to be in a conference call... and hitting the system at exactly the same time (but are otherwise mostly idle). The CPU graph might show 10-15% usage (no flag). Yet, those 5 users will report significant application performance issues (one of the users will have to wait 6-9s).

What I'd like to monitor, as a system administration, is the 95% utilization of the CPUs -- that is, over the minute, throw away the bottom 94% (mostly idle cycles) and report to me the CPU utilization of the next highest percentile. This should show me those pesky CPU spikes. Anything do that?

thrownaway2424 · on March 30, 2016

I don't really understand your question. Fundamentally a process is either on the CPU, or not. There's nothing between 0 and 100% CPU usage. CPU utilization only makes sense with a windowing function, for example in 90% of samples over the last second, some process was on the CPU.

I think what you want for your application, if I understand any of it correctly, is the time spent by a runnable process waiting to get scheduled on a CPU. That would indicate contention. If a process runs immediately when it becomes runnable, then there's no contention, no matter what the windowed utilization looks like.

barrkel · on March 30, 2016

You're more interested in response time than CPU usage; the customer is having a bad time whether the slow responses are due to CPU or IO or bad weather.

clarkevans · on March 30, 2016

That's another (important) measure, but, not the one I'm interested in. In the very short run, I can increase CPUs if there is hardware contention. The response time is an indicator of, but doesn't prove, CPU starvation. Request response time analysis is a longer term quality of service measure to identify and fix application (if it is even feasible).

...

There's a reason why ISPs bill at the 95% -- it's directly correlated with costs of over-subscribed resources. If you're running a VM cluster, you've got a similar issue, only with CPUs (and memory) as well as than bandwidth. Most operational divisions don't have the luxury of fixing a vendor application. Instead, they have limited variables to play with: CPU cores and memory being the coarse levers. While one could argue this is an "application" issue and you'd be correct, it's irrelevant to my question.

I'm asking what tools are available for system administrators to diagnose and address CPU starvation (under spiked usage). Current tools and techniques I'm aware of don't seem to measure this.

wang_li · on March 30, 2016

The only metric that matters is the measure of the business task at hand. You might dig into CPU utilization after you've identified a problem with your application, but trying to identify an application problem by measuring CPU is like trying to determine where your shipment is by looking at engine RPM of the freight vehicle.

That being said, historic CPU utilization is a useful metric for capacity planning.

dap · on March 30, 2016

Once you've identified long response times, it would also be possible to observe that the threads handling requests spent a good percentage of time waiting for CPU, which would point to CPU saturation as the problem. I'm not sure how you do this on GNU/Linux, but you can assess this on illumos with ptime(1) or prstat(1M).

I think your concern about measuring CPU utilization is real, though. You can use frequent sampling and present samples on a heat map to deal with this problem. There are some examples here: http://www.brendangregg.com/HeatMaps/utilization.html

evook · on March 30, 2016

Don't you overestimate an evening with strace and tcpdump a little bit as a "quality of service measure"? In my experience the reasons for high request response times are very easy to find, but not necessarily easy to fix. So at best it takes 30 minutes to find the root of the problem and 30 minutes to come up with a fix, at worst it takes 30 minutes to find and half a man-year to fix an entire stack.

StreamBright · on March 30, 2016

I am not aware of such tool. It is a good idea to report p50, p90, p99, p99.9 for CPU utilization but I also think it is good to report avg as well. Application performance issues should be monitored from the app point of view (response latency) and if you have that you can drill down and pin point CPU issues. Generally speaking it is better to monitor metrics from upper layers than just look at OS graphs. Most of the monitoring systems cover these in one dashboard so you can easily track down issues.

oxplot · on March 30, 2016

Gave it a try. Definitely not useful for running the daemon and view the UI on the same machine. Chrome at least eats 50% of one of the cores to show the realtime data.

On my RPi B, the daemon eats 4% average on all four cores, with almost all the time spent in the kernel. I assume polling the various entries under /proc/ is costly.

Wilya · on March 30, 2016

The dashboard is gorgeous, one of the prettiest I've ever seen.

But I wish it were a Riemann/Graphite/whatever dashboard instead of reimplementing its own data collection system.

There is a need for great dashboards, but I don't feel any need for yet another format of data collection plugins.

atombender · on March 30, 2016

Interesting! Really gorgeously rendered dashboards.

But also weird. The fact that both the collectors, the storage and the UI runs on each box makes this more like a small-scale replacement for top and assorted command-line tools such as iostat than for a scalable, distributed monitoring system. Lack of central collection means you cannot get a cluster-wide view of a given metric, nor can you easily build alerting into this.

I'm also disappointed that it reimplements a lot of collectors that already exists in mature projects like Collectd and Diamond (and, more recently, Influx's Telegraf). I understand that fewer external dependencies can be useful, but still, does every monitoring tool really need to write its own collector for reading CPU usage? You'd think there would be some standardization by now.

For comparison, we use Prometheus + Grafana + a lot of custom metrics collectors. Grafana is less than stellar, though. I'd love to have this UI on top of Prometheus.

sleepyhead · on March 30, 2016

Can we please stop with the "done right"?

glittershark · on March 30, 2016

Having a custom plugin architecture for this is a total dealbreaker. We already have statsd, why not just use that?

ktsaou · on March 30, 2016

Well, I believe performance monitoring should be realtime. I optimized netdata for this. A console killer! A tool you can actually use instead of the console solution. It is not (yet) a replacement for any other solution.

thesorrow · on March 30, 2016

Monitoring without alerting is kinda useless. How can I aggregate multiple servers ?

reitanqild · on March 30, 2016

You use zabbix :-)

Nice open source solution (everything open source, not "open core") with possibility to pay a reasonable (my words) amount for support.

Source: Use it at work, unsupported (because we are small and it just works without support at our current scale). Otherwise unaffiliated.

thrownaway2424 · on March 30, 2016

Monitoring without alerting is not useless. It's just instrumentation.

vacri · on March 31, 2016

Not at all - historical data can be extremely useful.

gedrap · on March 30, 2016

At the moment (like, literally now, just took a break and saw this), I am configuring graphite + collectd + grafana (and probably cabot on top for alerts), using ansible to set up collectd and sync the configuration across the nodes.

After some time of using graphite + statsd and friends, I came to really appreciate the benefits of using widely adopted open source components and the flexibility it gives over all-in-one solutions such as this. On the other hand, solutions like this are much easier to configure, especially the first time when you are not familiar with the tools yet.

wyldfire · on March 30, 2016

It's great that they've got all that explanatory prose for the metrics. That would help when reviewing data with other team members who aren't familiar with the context of each of these.

I have less of a realtime system review need than a post-mortem need. Today, I'll use kSar to do that, but this tool looks much more capable.

It's too bad that it doesn't provide an init script or other startup feature. The installer, while it doesn't seem to follow typical distribution patterns, is otherwise fairly complete.

ktsaou · on March 30, 2016

Nice you like it.

Init scripts are in the system directory.

I have already decoupled the code for handling the chart comments, but it needs some more work to remove it from the dashboard html and put it in separate json files.

guiye · on March 30, 2016

very nice look and feel, but it's doing http polling each second, maybe using websockets or SSE could perfom better, great work!

kabdib · on March 30, 2016

Did some spot checking. Found a race condition in the dictionary code in less than five minutes of poking around. Ugh.

Edit: code to add an entry to the dictionary releases its lock, whereupon you can wind up with duplicate NV pairs.

haswell · on March 30, 2016

This would be a more constructive comment if you provided more details or perhaps submitted an issue in GitHub and shared it here.

Why does "found a bug" equate to an immediate "ugh"? Is the expectation that projects submitted here are perfect?

kabdib · on March 30, 2016

I think it was more like the OP lead with his or her chin. "Done right? Okay..."

It's also true that pretty much every piece of monitoring software I encounter makes me sad when I look under the hood. And the proprietary ones that I've seen under the hood of . . . hoo boy.

ktsaou · on March 30, 2016

Hi, if you looked at dictionary.c, this is not used yet. But anyway, can you help me track it down?

djm_ · on March 30, 2016

It's open source on Github: make a pull request.

amelius · on March 30, 2016

It would be nice if it could show the processes that were running at the time of a peak in the graph.

Also, it would be nice if this could be run over multiple machines and show combined results.

Further, it appears that this tool shows information that other tools currently do not show. Perhaps nice if this tool allowed scripting and/or a CLI.

ktsaou · on March 30, 2016

You can see application resource usage at the applications section. This groups the whole process tree and reports cpu, memory, swap, disk, etc usage per process group. The grouping is influenced with /etc/netdata/apps.conf.

Dashboards can already have charts from multiple servers, but the UI for configuring this is missing. If you build a custom dashboard yourself, it can be done (just a div per chart - no javascript from your side).

Regarding scripting, please open a github issue to discuss it. I really like the idea.

vbtechguy · on April 12, 2016

netdata is perfect for single server monitoring it's perfectly suited to integration into my Centmin Mod LEMP stack installer https://community.centminmod.com/threads/addons-netdata-sh-n....

For folks wanting multiple servers, the wiki does mention that i believe at https://github.com/firehol/netdata/wiki#how-it-works

ausjke · on March 30, 2016

Impressive. The dashboard can be a bit condensed though, put all details on one page is a little overwhelming, maybe have some tabs(cpu,memory,disk,network,etc)?

rodionos · on March 30, 2016

nmon gives you console beauty without external dependencies. You can watch it in console mode and cron schedule it in batch mode for long-term data collection.

notinventedhear · on March 30, 2016

This looks really useful, although it doesn't seem to have a dashboard for showing the aggregated results from multiple running daemons.

ktsaou · on March 30, 2016

You can build this using HTML. Check https://github.com/firehol/netdata/wiki/Custom-Dashboards No javascript necessary from your part. Just a div per chart (each coming from a different server).

brndn · on March 30, 2016

Would implementing something like this on a server have any noticeable performance impact?

dsr_ · on March 30, 2016

There are performance notes in the wiki. As I understand it, CPU is unlikely to be impacted much, but you can increase the history setting away from the default and eat all your RAM.

Also, running plugins can increase CPU usage.

romanovcode · on March 30, 2016

Pretty cool, does it also auto-update itself? I also think it's a bit cluttered.

aroch · on March 30, 2016

I'm guessing it doesn't update its own packages (that'd be alittle odd), but it looks to be using nodejs stuff in the background so apt-get keeps that up-to-date.

Also looks like you can turn off plugins you don't want/need: https://github.com/firehol/netdata/wiki/Configuration

igama · on March 30, 2016

Looks pretty cool, going to test it soon.

jjuhl · on March 30, 2016

I'd recommend people to also check out SysOrb : http://sysorb.com/

crudbug · on March 30, 2016

+1 great work.. would love to see React port.

pmlnr · on March 30, 2016

It says fast, so no React, please.

alessioalex · on March 30, 2016

But .. but.. React is faster than JavaScript, isn't it? /s

crudbug · on March 30, 2016

That is debatable and more opinionated.

pmlnr · on March 30, 2016

Opinionated by not being on the newest hardware ;)

binaryblitz · on March 30, 2016

Ok Tim Cook.