Looks like another faster horse. A pretty GUI on /proc is not the most burning issue to solve in Linux performance monitoring. I wish anyone making these tools would spend 30 minutes watching my Monitorama talk about instance monitoring requirements at Netflix: http://www.brendangregg.com/blog/2015-06-23/netflix-instance... . I still hate gauges.
Where is the PMC support? At Facebook a few days ago, they said their number one issue was memory bandwidth. Try analyzing that without PMCs. You can't. And that's their number one issue. And it shouldn't be a surprise that you need PMC access to have a decent Linux monitoring/analysis tool. If that's a surprise to you, you're creating a tool without actual performance expertise.
Should front BPF tracing as well... Maybe it will in the future and I can check again.
First off, I think this is actually designed mainly for embedded applications. It's the only reason I can think of that they'd make a single self-contained monitor that only works for one host.
Second, it's great that you want a car instead of a faster horse, but I want 10 horses. Why?
- 1 horse trained for 1 job easier & more reliable than 1 horse/car trained for 10
- 10 horses more flexible than 1 horse or 1 car
- if a horse isn't doing its job, shoot it and replace it
- don't need to go to horse school to learn how to use or maintain one
- don't need to hire an engineer to teach horse new trick
- most problems can be solved with a horse
I get that with some applications, having a couple extra minutes saved by your monitoring system can mean millions of dollars saved. For those cases, there will always be custom-tailored solutions because there is value there. For 99% of the rest of the time, you get more value from simple tools that can be combined to do their jobs well.
If instead of focusing on the transportation we were focusing on the road, we could build a foundation for horses and cars to coexist. I'd love to see more protocols and specifications for each kind of monitor (other than syslog, RRD & SNMP) so they could simply be made according to spec and we could mix and match as we wished.
They are more than welcome to use FreeBSD's PMC support to roll their own implementation. But I predict they won't until another decade of beating their heads against broken tools and non engineered solutions, light through yonder window will finally break.
Yes, I'm involved, and no, I don't think the current public release of Vector tackles many of the issues yet. But we've been working on them and will have it released when we can. Items include:
- PMCs: Vector's backend is pcp, which has a Linux perf_events pmda for PMC support. There's no PMC counters by default in Vector since we're using it in an environment without PMC access. But I've been heavily working in this area. More later on.
- Flame graph support: we already have it and use it. Ongoing work includes new flame graph types, and a rewritten flame graph implementation in d3. Need to get it all published.
- Heat map support: just solved an issue with them; again, an area where we have ad hoc tools that work and bring value, but haven't wrapped it all in Vector yet.
- BPF support: I've been prototyping many new metrics that we want in Vector, https://github.com/iovisor/bcc#tools. There'll be increased demand to get Vector accessing these in a couple of months or so, when we have newer kernels in production that have BPF.
Don't use the red-green combination in charts as it makes it really hard to read for those of us with a degree of red-green color blindness (which is the most common type in the ~5% of the male and ~1% female population that has it).
They probably don't know. It's very likely that a subsample of that ethnicity had good medical data and thus conclusions can only be drawn conclusively for that subsample. However it would indicate an area for study should more reliable sample sizes of other 'near' populations be possible.
I would say rather, provide an option to change colors for those with various types of color blindness. That way it's accessible but still intuitive to those who can discern and are familiar with typical meanings of red and green.
Yes, for us that do have one to five machines this is awesome, because most other monitoring solutions are really annoying to deploy because they presume more than five machines :)
Maybe "monitoring" is a poor word choice when there are no alerting/notification capabilities or statics-rollup. But there are different tools out there to do just that.
This seems super useful to zoom in one some performance characteristics of a single server to debug some issues!
so from "Introducing-netdata" on their Wiki, there's this excerpt..
With netdata, there is no need to centralize anything for performance monitoring. You view everything directly from their source. Still, with netdata you can build dashboards with charts from any number of servers. And these charts will be connected to each other much like the ones that come from the same server.
So seems like this is on purpose. And if I understand this correctly, I can just build custom HTML dashboard to connect to multiple machines?
Not to knock on this project, I use it and love it, but the learning curve involved here is not trivial. Using it basically requires that you're an accomplished frontend developer, and all this to just display some metrics on a page.
Nine different languages and libraries! Jesus H. Christ!
It's plain to see why outfits like Splunk can get away with charging as much as they do - visualizing metrics in that app is as simple as installing a deb package, logging in, and pointing it at your data source.
Setting up Dashing by hand is comparatively ...difficult.
You forgot Clojure. Riemann's monitoring config is written in Clojure.
There's a much more approachable all-JS Riemann clone called Godot that I've successfully used on multiple projects, but it still requires some work to make the frontend look good.
I'm sure most of us can assume reasons why collecting data with node.js might be "wrong," but it would be more helpful to the conversation if you would spell out specific reasons why using node.js for this use case is not optimal, instead of just commenting with a single emoji.
Because it's just not necessary, and meanwhile most experienced SREs and performance engineers are imperative programmers. We don't see the benefits of using JavaScript on our servers, and that's rather putting it mildly.
Also JavaScript doesn't have a 64-bit integer type, which is absolutely necessary to properly support large counters. There are workarounds but the fact that you need one is ridiculous.
Many of us believe that the only reason one would want to use JavaScript is because you're doing browser programming and therefore you have no other choice.
I've seen plenty of server-based applications written in node.js. Just because javascript used to be pigeon-holed into front-end doesn't mean it can't perform well on server applications.
A good example of a node.js app that is purely server side is Hubot, a chat bot created at Github and widely used in Slack, HipChat, IRC, etc. I'm sure there are thousands of others out there, and I don't believe being written in Javascript gives them any fundamental disadvantage to server applications written in Python, Java, C++, or any other language.
"If you have a hammer, everything starts to look like your thumb."
JavaScript is reasonably suited to servers written around event loops like Hubot. But as a general purpose programming language I think it's reasonable to argue that it's pretty bad compared to alternatives like Erlang and Go in the space it's used in. Also debugging is a pain in the ass because stack traces aren't useful at all. (That's not limited to JS but it does make my life harder.)
In any event, we're talking about a telemetry collector, which is generally a trivial piece of code that simply doesn't need whatever benefits JS purports to provide.
It's not about Node per se; it's about JavaScript. System telemetry collectors don't benefit much from languages built around event loops.
Many of us already know C, Bourne Shell and Python and probably another scripting language and all of those can get telemetry data off systems and into an event bus or metrics aggregator quickly enough. Adding JavaScript adds complexity without giving us significant new functionality.
Now the data collector (gathering metrics from many senders for aggregation and storage) is a different story. I've seen them written in JavaScript but the 64-bit counter difficulties would rule it out for me were I to implement one again.
Can you explain how the npm incident is related to typing? I am an advocate of strong typing and love seeing it be the answer to all sorts of problems (e.g. concurrency), but I'm not understanding the connection here.
Not parent, but IMHO requiring nodejs is heavy for something as simple as data collection. I wouldn't mind if plugins (sufficiently complex ones) can be written in nodejs, but requiring nodejs is too much.
I'd prefer something like Lua (and C, of course), or even Python (which is installed on most systems anyway, but still too heavy).
Is there such a thing as 95% threshold CPU monitoring?
Consider an application spikes (close to 100% on a core) for 2-3s on some web requests -- let's assume this is normal (nothing can be done about it). Now, let's consider the average user of the system is idle for 2 minutes per web request. So, users won't see performance degradation unless $(active-users) > $(cores) during a 2-3 minute window.
For most monitoring systems, CPU is reported as an average over a minute, and, even if it's pinned only 2-3s per 60s, that's only 5% usage. Presume a 2 CPU system with 5 users, who all happen to be in a conference call... and hitting the system at exactly the same time (but are otherwise mostly idle). The CPU graph might show 10-15% usage (no flag). Yet, those 5 users will report significant application performance issues (one of the users will have to wait 6-9s).
What I'd like to monitor, as a system administration, is the 95% utilization of the CPUs -- that is, over the minute, throw away the bottom 94% (mostly idle cycles) and report to me the CPU utilization of the next highest percentile. This should show me those pesky CPU spikes. Anything do that?
I don't really understand your question. Fundamentally a process is either on the CPU, or not. There's nothing between 0 and 100% CPU usage. CPU utilization only makes sense with a windowing function, for example in 90% of samples over the last second, some process was on the CPU.
I think what you want for your application, if I understand any of it correctly, is the time spent by a runnable process waiting to get scheduled on a CPU. That would indicate contention. If a process runs immediately when it becomes runnable, then there's no contention, no matter what the windowed utilization looks like.
You're more interested in response time than CPU usage; the customer is having a bad time whether the slow responses are due to CPU or IO or bad weather.
That's another (important) measure, but, not the one I'm interested in. In the very short run, I can increase CPUs if there is hardware contention. The response time is an indicator of, but doesn't prove, CPU starvation. Request response time analysis is a longer term quality of service measure to identify and fix application (if it is even feasible).
...
There's a reason why ISPs bill at the 95% -- it's directly correlated with costs of over-subscribed resources. If you're running a VM cluster, you've got a similar issue, only with CPUs (and memory) as well as than bandwidth. Most operational divisions don't have the luxury of fixing a vendor application. Instead, they have limited variables to play with: CPU cores and memory being the coarse levers. While one could argue this is an "application" issue and you'd be correct, it's irrelevant to my question.
I'm asking what tools are available for system administrators to diagnose and address CPU starvation (under spiked usage). Current tools and techniques I'm aware of don't seem to measure this.
The only metric that matters is the measure of the business task at hand. You might dig into CPU utilization after you've identified a problem with your application, but trying to identify an application problem by measuring CPU is like trying to determine where your shipment is by looking at engine RPM of the freight vehicle.
That being said, historic CPU utilization is a useful metric for capacity planning.
Once you've identified long response times, it would also be possible to observe that the threads handling requests spent a good percentage of time waiting for CPU, which would point to CPU saturation as the problem. I'm not sure how you do this on GNU/Linux, but you can assess this on illumos with ptime(1) or prstat(1M).
I think your concern about measuring CPU utilization is real, though. You can use frequent sampling and present samples on a heat map to deal with this problem. There are some examples here:
http://www.brendangregg.com/HeatMaps/utilization.html
Don't you overestimate an evening with strace and tcpdump a little bit as a "quality of service measure"? In my experience the reasons for high request response times are very easy to find, but not necessarily easy to fix. So at best it takes 30 minutes to find the root of the problem and 30 minutes to come up with a fix, at worst it takes 30 minutes to find and half a man-year to fix an entire stack.
I am not aware of such tool. It is a good idea to report p50, p90, p99, p99.9 for CPU utilization but I also think it is good to report avg as well. Application performance issues should be monitored from the app point of view (response latency) and if you have that you can drill down and pin point CPU issues. Generally speaking it is better to monitor metrics from upper layers than just look at OS graphs. Most of the monitoring systems cover these in one dashboard so you can easily track down issues.
Gave it a try. Definitely not useful for running the daemon and view the UI on the same machine. Chrome at least eats 50% of one of the cores to show the realtime data.
On my RPi B, the daemon eats 4% average on all four cores, with almost all the time spent in the kernel. I assume polling the various entries under /proc/ is costly.
But also weird. The fact that both the collectors, the storage and the UI runs on each box makes this more like a small-scale replacement for top and assorted command-line tools such as iostat than for a scalable, distributed monitoring system. Lack of central collection means you cannot get a cluster-wide view of a given metric, nor can you easily build alerting into this.
I'm also disappointed that it reimplements a lot of collectors that already exists in mature projects like Collectd and Diamond (and, more recently, Influx's Telegraf). I understand that fewer external dependencies can be useful, but still, does every monitoring tool really need to write its own collector for reading CPU usage? You'd think there would be some standardization by now.
For comparison, we use Prometheus + Grafana + a lot of custom metrics collectors. Grafana is less than stellar, though. I'd love to have this UI on top of Prometheus.
Well, I believe performance monitoring should be realtime. I optimized netdata for this. A console killer! A tool you can actually use instead of the console solution. It is not (yet) a replacement for any other solution.
At the moment (like, literally now, just took a break and saw this), I am configuring graphite + collectd + grafana (and probably cabot on top for alerts), using ansible to set up collectd and sync the configuration across the nodes.
After some time of using graphite + statsd and friends, I came to really appreciate the benefits of using widely adopted open source components and the flexibility it gives over all-in-one solutions such as this. On the other hand, solutions like this are much easier to configure, especially the first time when you are not familiar with the tools yet.
It's great that they've got all that explanatory prose for the metrics. That would help when reviewing data with other team members who aren't familiar with the context of each of these.
I have less of a realtime system review need than a post-mortem need. Today, I'll use kSar to do that, but this tool looks much more capable.
It's too bad that it doesn't provide an init script or other startup feature. The installer, while it doesn't seem to follow typical distribution patterns, is otherwise fairly complete.
I have already decoupled the code for handling the chart comments, but it needs some more work to remove it from the dashboard html and put it in separate json files.
I think it was more like the OP lead with his or her chin. "Done right? Okay..."
It's also true that pretty much every piece of monitoring software I encounter makes me sad when I look under the hood. And the proprietary ones that I've seen under the hood of . . . hoo boy.
You can see application resource usage at the applications section. This groups the whole process tree and reports cpu, memory, swap, disk, etc usage per process group.
The grouping is influenced with /etc/netdata/apps.conf.
Dashboards can already have charts from multiple servers, but the UI for configuring this is missing. If you build a custom dashboard yourself, it can be done (just a div per chart - no javascript from your side).
Regarding scripting, please open a github issue to discuss it. I really like the idea.
Impressive. The dashboard can be a bit condensed though, put all details on one page is a little overwhelming, maybe have some tabs(cpu,memory,disk,network,etc)?
nmon gives you console beauty without external dependencies. You can watch it in console mode and cron schedule it in batch mode for long-term data collection.
There are performance notes in the wiki. As I understand it, CPU is unlikely to be impacted much, but you can increase the history setting away from the default and eat all your RAM.
I'm guessing it doesn't update its own packages (that'd be alittle odd), but it looks to be using nodejs stuff in the background so apt-get keeps that up-to-date.
Where is the PMC support? At Facebook a few days ago, they said their number one issue was memory bandwidth. Try analyzing that without PMCs. You can't. And that's their number one issue. And it shouldn't be a surprise that you need PMC access to have a decent Linux monitoring/analysis tool. If that's a surprise to you, you're creating a tool without actual performance expertise.
Should front BPF tracing as well... Maybe it will in the future and I can check again.