All these graphs are never really actionable and are only of interest for a short period of time and you won't be looking at it after a while because they don't mean anything unless you know where and when the problem is.
A sever admin wants "Incident" panel that only shows anomaly components at the top coupled with adjustable alerting mechanism and not just a dump of all the data there is blindly.
There are so many tools that does this and pretend it's impressive including ELK but whether it's Grafana or Kibana, you need a lot of manual tweaking to make the dashboards actually useful.
This is one of the things i am focusing on most - how to package and then surface up "anomaly events" to the user that the user can then quickly digest and decide if they are or are not something that could represent an "incident". So human in the loop sort of ML to help assist and lower the cognitive load of all the charts.
That will give two summary aggregation type charts based on anomaly scores built off all the system.* charts by default (but can be configured however you want) - so, at each second, an anomaly probability for each chart and an anomaly flag if the model thinks that chart looked anomalous at that time.
We are also working on some other related projects to build on this capability out more and do it at the edge, in C++ or Go as opposed to Python, (or via a parent node) as cheap as possible so minimal, ideally negligible, impact on the agent itself. We should have some more features related to this in coming months as we are just trying to dogfood them internally a little first.
What's the argument for anomaly detection - it's an obvious thing to do that has been tried many times, but doesn't actually seem to provide much value in practice (especially at large scale, where you'll get spurious correlations).
What would you need it for? Once you defined your SLOs, either your service meets them or not. What's the value in alerting someone that "this graph looks funny"?
I generally agree that well defined SLOs for back end services works as you define service contacts between services and care less about the particular funny graph being surfaced more that a particular service is out of contract.
Where automatic anomaly detection was very valuable for us was in the video domain with multi dimensional end user telemetry. I.e what would be lost as noise in top level metics could be surfaced via anomaly detection for specific combination of dimensions that you could not otherwise manually observe. I.e video start time in Mexico is fine ... But an ISP in Mexico City is not failing but when data is sliced and anomaly highlighted we see its newly under preforming and we need to feed this data into our CDN switching to improve video start time there.
The data had too many dimensions that were always changing with degraded experience easily lost in the noise when measuring across platform and our software updates, combinations target devices, connection types, geo location, specific content, active ab tests, etc. In such cases automatic anomaly detection was pretty critical.
I almost think of anomaly detection as a UI/UX type tool to help users navigate the data/systems. So use ML to find "interesting" or "novel" periods of time in your architecture (in the sense that the ML thinks they look novel based on some model), and then enable a user who is ultimately best placed to decide if it's actually of real interest to them or more like a false positive that they can just ignore and move on.
So doing it in a way where you can quickly scan such events i think could be useful even if only 1 in 20 actually turn out to be some potential problem that might have been missed by your alarms or maybe could even be a precursor to some impact on SLO's etc.
The aim would also be "this collection of graphs look funny at the same time" as opposed to "the individual graph looks funny" as if you have an anomaly score for every chart for sure at any given moment some individual charts would be randomly firing. But when you pool the information across charts and hosts and systems the hope is that then you can use anomaly detection as another way to explore your system and catch when things change unexpectedly.
It's about troubleshooting. When you have a complex infrastructure, it's not enough to say that your db queries are slower than usual. Ok, so you immediately see that your db server is getting a lot more traffic. What was the root cause though and what can you do about it now? Given enough "funny charts", you can see for example that you have hit a resource limit that you can temporarily raise and also see that a particular component of your infrastructure has an anomalous behavior, e.g. a cron job that was usually utilizing resources for a few seconds, now takes minutes. So you can provide a quick workaround and move on to investigate what changed with that cron job.
Do 4xx responses count against your SLO? For me they don't, but an abnormal increase might still signify that something is actually wrong. (I haven't yet found a useful tool for highlighting this kind of abnormality though)
Since you're creating an interface for python algotithms please make the interface public well documented and easily extendable snd allow it to have different algo per meter.
I completely disagree. I work on a system with multiple servers communicating with each other and billions of events per day. Watching the meters wiggle on all the servers simultaneously is a really important debugging tool. If something slows down or goes down, and you know what it's connected to, it's pretty easy to troubleshoot what the cause is at an incredibly specific level just by looking at the meters.
I think it's likely one's take on whether these tools are useful is very dependent on the system architecture.
Thank you for this feedback. I am the founder of Netdata.
Netdata is about making our lives easier. If you need to tweak Netdata, please open a github issue to let us know. It is a bug. Netdata should provide the best possible dashboards and alerts out of the box. If it does not for you, we missed something and we need your help to fix it, so please open a github issue to let us know of your use case. We want Netdata to be installed and effectively used with zero configuration, even mid-crisis, so although tweaking is possible and we support plenty of it, it should not be required.
An "incident" is a way to organize people, an issue management tool for monitoring, a collaboration feature. Netdata's primary goal however, is about exploring and understanding our infrastructure. We are trying to be amazingly effective in this by providing unlimited high resolution metrics, real-time dashboards and battle tested alarms. In our roadmap we have many features that we believe will change the way we understand monitoring. We are changing even the most fundamental features of a chart.
Of course at the same time we are trying to improve collaboration. This is why Netdata.Cloud, our free-forever SaaS offering that complements the open-source agent to provide out of the box infrastructure-level monitoring along side several convenience features, organizes our infra in war-rooms. In these war-rooms we have added metrics correlation tools that can help us find the most relevant metrics for something that got our attention, an alarm, a spike or a dive on a chart.
For Netdata, the high level incident panel you are looking for, will be based on a mix of charts and alarms. And we hope it is going to be also fully automated, autodetected and provided with zero configuration and tweaking. Stay tuned. We are baking it...
The same way GitHub, Slack or Cloudflare provide massively free-forever SaaS offerings while making money.
We believe that the world will greatly benefit by a monitoring solution that is massively battle tested, highly opinionated, incorporating all the knowledge and experience of the community for monitoring infrastructure, systems and applications. A solution that is installed in seconds, even mid-crisis and is immediately effective in identifying performance and stability issues.
The tricky part is to find a way to support this and sustain it indefinitely. We believe we nailed it!
So, we plan to avoid selling monitoring features. Our free offering will never have a limit on the number of nodes monitored, the number users using it, the number of metrics collected, analyzed and presented, no limit on the granularity of data, the number of war-rooms, of dashboards, the number of alarms configured, the notifications sent, etc. All these will always be free.
And no, we are not collecting any data for ML or any other purpose. The opposite actually: we plan to release ML at the edge, so that each server will learn its own behavior.
We plan to eventually sell increased convenience features, enforcement of compliance to business policies and enterprise specific integrations, all of them on top of the free offering.
I was analyzing the activity in the netdata project and what I found interesting was this project is less active than I would have thought. See the following for insights into the project:
In the last 30 days, there were 2 frequent and 3 occasional contributors. I honestly thought frequent contributors would have been much higher, which leads me to believe the project is quite mature and they don't need a lot of people to work on netdata.
Based on Crunchbase, they've raised about 33 million so far and if the number of people required to maintain netbase is low (relatively speaking that is), I can see them not really needing to worry about making money and I'm guessing they are finding value in gathering data for ML.
> if the number of people required to maintain netbase is low (relatively speaking that is)
The Netdata agent is a robust and mature product. We maintain it and we constantly improve it, but:
- most of our efforts go to Netdata.Cloud
- most of the action in the agent is in internal forks we have. For example, we are currently testing ML at the edge. This will eventually go into the agent, but is not there yet. Same with EBPF. We do a lot of work to streamline the process of providing the best EBPF experience out there.
> I can see them not really needing to worry about making money
We are going to make money on top of the free tier of Netdata.Cloud. We are currently building the free tier. In about a year from now we will start introducing new paid features to Netdata.Cloud. Whatever we will have released by then, will always be free.
> I'm guessing they are finding value in gathering data for ML
No, we are not gathering any data for any purpose. Our database is distributed. Your data are your data. We don't need them.
p.s. i am the only person working on ML at Netdata and i can confirm we don't gather any data for ML purposes, which is actually my biggest challenge right now :) - convincing people the ML can be useful without having lots of nice labeled data from real netdata users to be able to quantify that with typical metrics like accuracy etc. I'm hoping to introduce mainly unsupervised ML features into the product that don't rely on lots of labeled data and have thumbs up/down type feedback and we can then use that to figure out if new ML based features are working or being useful for users. So any models that would be trained would be trained on the host and live on the host as opposed to in Netdata Cloud somewhere.
> i am the only person working on ML at Netdata and i can confirm we don't gather any data for ML purposes, which is actually my biggest challenge right now :)
Yeah I would have to imagine that it would be an issue. This is just my personal opinion, but I think there should be a way to provide anonymized data for building models for anomaly detection. Maybe an opt-in feature, as it would benefit everybody using netdata, but this is just my own personal opinion.
There's value in having large dashboards that contain a bunch of non-prioritised graphs and gauges. I've managed to find a fair few problems by scrolling through such dashboards. Usually it's due to poorly configured monitors/alerts, but sometimes I'll spot things that you wouldn't reasonably expect an algorithm to pick up.
Plus it's good fun to look at a big dashboard and pretend you're Homer Simpson at the Springfield Nuclear Power Plant.
Based on my experience a dashboard that you only use when you know where and when the problem occurred is incredibly useful, and the lack of one can be very frustrating. While you of course need a systematic approach to incident detection, you also need comprehensive eyes-on-glass dashboards during your investigations. "Anomaly detection" is much spoken of but generalized anomaly detection doesn't exist. You still need skilled operators to just have a look around in many cases.
An example, drawn from several major incidents in my career. You get an alert, you narrow it down to a process or machine, you evict the machine from your serving population to remediate the incident, but how do you keep it from recurring? The anomalous thing isn't apparent in your monitoring data, so it must be among the bazillion statistics that a running system exposes, but which you can't afford to collect and monitor on a per-host, per-container, per-process level of detail. That's when you want something exactly like netdata!
> never really actionable ... only of interest for a short period ... you know where and when the problem is.
I'm not a sysadmin of a large shop (I did that for a short bit, but prior to this existing), so I can only speak as a guy who runs a few big linux servers/virtuals. I've had netdata installed on my home severs for quite some time. And yes, the graphs were really cool, at first, and kinda went into the background.
Here's the thing: when something isn't right with those boxes, that's become the first place I visit. Since I had some franken-boxes with a bunch of storage, it's often related to the array, or btrfs. When I hop in there I'll notice an alert or two, gooble it, alter something and never see it again. It helped me solve some network issues.
I don't know, short of it being a more busy process than I wish it were on my server (only a little and I'm running a few plug-ins on the one that I'm unhappy with), it's been helpful.
Same. I'm not a fully qualified sys admin but I do have access to a number of our servers (I'm more of a full stack generalist than an expert at anything) and I immediately go to netdata when one of my services isn't acting right. For me its a nice 'system at a glance' where I can check on the host and then alert someone more knowledgeable than myself if there's something that looks off
I think you took a lot of flack in the original comment[0]
The alarms in netdata resolved a long-standing network issue on one of my boxes, and have variously alerted me to problems I could resolve with storage which greatly improved performance on my largest volume. On my other box, one look at the graphs alerted me to the fact that the entire SSD for my bcache volume was going unused[1]. I then used them while altering configuration and working with the drive to ensure the cache was being filled in a manner consistent with what the volume stored/how it was used.
The more I think about it, I might not have been as enthusiastic in my original comment as I should have been. It's been very helpful to me. I don't usually keep things like this running for very long (it wastes cycles on aging hardware...that isn't heavily used, but hey, it's the principal!) but I've kept this around because every time I've thought about removing it, I've visited the dashboard one last time and found something there that made me keep it.
[0] Though, as I mentioned, I'm not a sysadmin; I have a lab that might indicate otherwise, but I don't get paid for it.
[1] I had reloaded the machine/redone a previous configuration that included bcache and it screamed; I knew my new setup was much slower but I had forgotten about it until netdata made it obvious, again. I can't remember what I had to do to fix it, but it had something to do with the policy used to determine if a file should be put into the cache, and I think it was related to the fact that the cache was added to a volume with data present that rarely changed.
I do not necessarily disagree with you regarding what a server admin / ops personnel needs, however;
I for one deeply enjoy interacting with my Netdata dashboard whenever I want to deep dive into my servers resources and behaviors. For me it fits a purpose and if I ever were to run a company that hosted things, I would want it and I would want to pay for it.
I am a huge fan and a long time homelab user of Netdata.
> A sever admin wants "Incident" panel that only shows anomaly components at the top coupled with adjustable alerting mechanism and not just a dump of all the data there is blindly.
Netdata does this too, with a ton of thresholds already set up by default. The list of active alerts is at the top, with a badge and everything. Notifications use a hook system, so you can use whatever mechanism you like. Personally I get emails for medium level alerts, SMS for high and above, and wall posts/notifications on my primary machine for crits. It took some tuning to get the thresholds right for me, all perfectly easy to do.
I agree I would prefer to have the active warnings more visible than the graphs, but one click away really isn't bad.
I agree with this, and it's interesting how many open source tools there are that create these graphs and charts, store tons of data, etc. All with a mostly "eyes on glass" bent, which doesn't scale terribly well.
When, really, what's more important is actionable events, correlation, duplicate suppression, escalating notifications, etc. Something like what "Netcool Omnibus" and other commercial software does. Isolate actionable problems and make sure somebody owns the problem.
But for reasons I don't understand, there isn't much in the open source world in that space.
I run Netdata on my home server using the official docker image. I mostly use it to detect run-away containers, and monitor system temprature. For these use-cases, it works great, and I like that it's self contained; way less headache than stringing together Graphviz stuff, or setting up Nagios or Prometheus.
Beware - by default it'll send home telemetry, and the web UI will try to "register" your instance with some kind of cloud. I find this super annoying, but it's possible to turn it off; just not well documented.
There's also a lot of plugins that scrape many kinds of logs, look at process data, etc. Again, might be useful, but for a home user it's much better to turn it all off.
Neat project but I'm also confused with 'distributed'; It sounds like designed for monitoring multiple systems in a single dashbaord OOB with 'zero-config', But on further digging it seems like the distributed monitoring works 'only' with their cloud service[1].
It's OOTB configured to use their free cloud service, but with 2 lines of config you can run your own central collection point instead. That's what I do for my home install.
BUT the UI for this is just a dropdown for each of your monitored servers. I've found I actually want to export data to a more robust system so I can view patterns across machines, too.
Thanks, So monitoring multiple machines is possible in a central console although not in a single dashboard. Hopefully it will be available soon and the project seems useful as it is as I like the idea of getting the system information of all the Servers, SBCs in my network.
This is what we do with Netdata Cloud. We want to keep the FOSS agent as a powerful single node monitoring tool and use the cloud for free infrastructure monitoring. You can see ktsaou's comment above on how we intend to monetize.
That is true. It has some features to allow to quickly jump between many machines, but indeed out of the box it is not for many servers at the same time.
BUT it can be configured to push data to Prometheus or similar (it's called a "backend") and some other integrations like notifications can be done.
Super neat project, very easy to set up. I highly recommend it to anyone who does performance troubleshooting. Netdata put into a standard Linux system will detect a lot of different things, like firewalls, containers, a lot of software like databases, queuing systems, mail systems and provide additional data every second.
We complement the Netdata agent with Netdata.Cloud, a free-forever SaaS offering that maintains all the principles of the Netdata agent, while providing infrastructure level monitoring and several additional convenience features.
In Netdata.Cloud, infrastructure is organized in war-rooms. On each war-room you will find the "Overview" page, that provides a fully automated dashboard, very similar to the one provided by the agent, in which every chart presented aggregates data from all servers in the war-room! Magic! Zero configuration! Fully automated!
Keep in mind that Netdata.Cloud is a thin convenience layer on top of the Netdata agent. We don't aggregate your data. Your data stay inside your servers. We only collect and store a few metadata (how many netdata agents you have, which metrics they collect, what alarms have been configured, when they triggered - but not the actual metric and log data of your systems).
Awesome! I see the free tier is indeed looking generous. Just hooked up a node and looks good - I like the calculate correlations on alerts thing in particular.
>Keep in mind that Netdata.Cloud is a thin convenience layer on top of the Netdata agent.
I see. Didn't know/understand that.
On the claim node page - could you perhaps add the kickstart bash script code too? I find myself needing them one after the other yet they're on different pages
At the moment it's based on a short window of data so the focus is more for short term changes around an area of interest you have already found.
Longer term it would be cool to be able to use an anomaly score on the metrics themselves (or if a lot of alarms happen to be going off) to automatically find such regions for you so its more like surfacing insights to you as opposed to you having to already know a window if time you are interested in.
>Keep in mind that Netdata.Cloud is a thin convenience layer on top of the Netdata agent. We don't aggregate your data.
I didn't get that from the website until just now. I was looking and looking for how much it would cost to subscribe for our 150 dev/stg/prod VMs -- usually that's the killer.
Indeed, very handy tool. I once used it to discover that a new deployment generated spike CPU load. Reason was a badly implemented javascript doing a call on the db when hovering over a product (preview stock). Fun to see the actual correlation in a GUI in realtime.
There's a netdata prometheus exporter, but it overlaps a lot with node-exporter. If you're already runnning netdata however then it could be a good choice
node_exporter is a lot more robust. Had both running for a while and netdata would get stuck when there's I/O trouble, while node_exporter was carefully built not to do any I/O and kept working just fine.
Note that netdata phones home without consent in the default configuration. For many, the whole point of doing system-administration is selfhosting and autonomy, and privacy is frequently a big component of that.
Netdata blows a big hole in that by transmitting your usage information off of your box without getting permission.
we use the data we gather in order to make smarter product decisions. We want to invest resources where it matters, so we need to know how our users use the product.
I hear you, we know that our audience is sensitive to their privacy. We are all are.
Here are a couple of thoughts that have guided us. Thank you for engaging in this conversation and for caring enough.
1) This data is crucial for us. We need as much as we can get and it's highly specialized to Netdata. (e.g a sudden increase in crashes will prompt our team to see recent changes)
2) The more friction we add (opt-in), the less people will do it (because people choose the easier route, always) and thus we will have less data to work with.
3) People who care enough, as you said, about their privacy, can *very* easily disable the anonymous statistics, by both adding a flag to the install script or doing a small config change afterwards. I feel that we are communicating in many different places that we take anonymous data, so most of our users should be informed.
4) It's a fairly standard industry tactic and I don't believe that other solutions are not doing. Of course this is not an excuse for anything, just noting that we are not an outlier.
Thanks again for engaging. Feedback is great for us, it makes us both happy (because someone cares enough) and better.
The telemetry isn't anonymous: it includes the client IP; the method you use to transmit the data cannot work anonymously.
Additionally, what's actually unfair is that you proceed with this spying without the consent of the user. Being upfront about it is not obtaining consent: it's just informing the user you're about to violate their (lack of) consent.
You must obtain consent from the user first, before transmitting their information. Otherwise, your software is spyware. (Disclosing that you're going to spy on the user doesn't make you not-spyware.)
> we use the data we gather in order to make smarter product decisions.
Yes, you transmit the private data of the user for the express purpose of enriching yourself.
Opt-out is unethical: you must obtain opt-in consent first. The data you are transmitting does not belong to you.
We actually mask the ip address (https://github.com/netdata/dashboard/blob/master/src/domains...) so it's not even sent - we just send "127.0.0.1" as the IP into our self hosted PostHog. Likewise with any URL, referrer type event properties that could leak a hostname to us - we don't want that data at all so explicitly mask it before even capturing it in our telemetry system.
Previously, when using a fairly standard Google Analytics implementation, we could not really have this level of control all that easily.
So the hope is that with PostHog we can do better here while still enabling some really useful product telemetry to help us understand how to make the product better over time and try catch bugs and issues quicker too.
Oh and we have removed Google Tag Manager (GTM) from the agent dashboard too so that that's no longer around as a possibility for loading other third party tags too.
Your claim is false; the IP address cannot be "masked" the way you describe. The spy telemetry transmits the IP as the L3 source on each and every packet.
I literally shared a link to the part of the code that shows we don't capture and record the ip address in our telemetry. You are being quite disingenuous calling things "spyware".
I do appreciate the opt-in vs opt-out argument and I think on balance if opt-out helps us make this free product better over time and help our users then, so long as there is a clear route for people to opt-out, it's worth it and crucially important.
But this is indeed more like an opinion that individuals might differ on in terms of the pros and cons.
I personally love sending telemetry especially to help make the products I love better :) feels like im giving something back. But that is just my own opinion.
You choose to willfully install Netdata. You have to read the docs where the opt-out telemetry is clearly explained, before you can self-host it too. If you care, you can disable it.
I honestly don’t understand HN. Multiple commenters deriding a free open-source project for having basic telemetry to understand feature usage.
I feel like you are willfully misunderstanding: netdata transmitting the data without consent is unethical: it's not their data to send.
I did not choose to willfully install Netdata - I don't use it because it is unethical spyware.
Telling someone "if you stay where you are, I am going to do $THING_REQUIRING_CONSENT to you in 20 minutes" is not obtaining consent if the person doesn't, say, leave the building. Being in the hospital is not a blanket consent to anything the doctor wants to do, for example.
To transmit a user's private data (their usage) to the app vendor is unethical unless the user has specifically indicated that they want that to happen. If they haven't (and simply installing the software is not that), transmitting it anyway is, at best insanely rude, and at worst actively malicious (like, for example, how Netlify's CLI used to transmit "I opt out of telemetry" events, before I got them to stop).
Calling nonconsensual spyware "basic telemetry" is a euphemism.
>> Being in the hospital is not a blanket consent to anything the doctor wants to do
No, in this case, you willfully signed up for a surgery and decided to skip reading the T&C.
>> user has specifically indicated that they want that to happen
You did this by installing the software without opting-out.
You sound entitled and spoiled. And by incessantly accusing netdata of being spyware, it feels like you are not willing to have a constructive discussion.
I have played around with netdata just yesterday on my home server. Great tool, but the defaults are overkill for my needs. After spending an hour trying to simplify (=disable most of the "collectors") using the documentation, I finally gave up.
Settled on neofetch [1] instead: pure bash, wrote my own custom inputs including color coding for incident reporting in less time than it took me to strip down netdata. Highly recommended if you want to spend your time on other things than (setting up) server monitoring.
Thanks for the link: neofetch seems a good tool when you just want to manually see what is going on. Netdata is also designed to alert, forward data to other locations, monitor at 1 second granularity, and to store historical data efficiently if you want to see what went on in the recent past.
It's because it was built with high granularity and unlimited metrics as a key differentiator from the beginning. The core is written in pure C, optimized to death. Even long-term retention was initially sacrificed, in order to be able achieve that high performance, with minimal resource needs.
Long term retention is now possible, but with relatively high memory requirements, depending on how many metrics are collected. Again, it was a decision to never give up realtime granularity and speed, even at the cost of writing our own timeseries db in C and utilizing more memory.
Prometheus is designed around metric centralization and running a scraper at some interval (every 15s). Netdata was originally focused on running on a single node and collecting at small intervals. Centralizing that data every second is a separate task, and you could avoid it with Netdata simply by viewing Netdata on the node in question. Netdata can also be configured to stream data to a central node.
The centralized pull architecture of Prometheus does not lend itself towards small interval updates or towards resiliency (you actually need to run 2 Prometheus and double scrape for that).
It's a locally installed agent that monitors and serves metrics on the same host. If you want to monitor multiple hosts then you can either visit the dashboards individually, or scrape the APIs and put the metrics on a combined dashboard - which is what Netdata Cloud is.
We have a central influxdb with telegraf metrics among others, and some grafana graphs.
I still install netdata on every machine though. Almost never use it, but there have been some times where it was useful to look at netdata. It's light weight enough that it hasn't been a problem.
The only gripe I have with it is the approach to security, i.e. the lack of user accounts (even one). So you have to either block the stats by IP (who is doing it these days?) or use other workarounds like proxying by Nginx etc.
I have it listen on a loopback interface and do SSH port forwarding when I want to look at the stats. Nginx proxying with basic auth is a perfectly reasonable approach and not a workaround in my humble opinion. I would trust these two approaches more than an unknown mechanism in Netdata.
Using Netdata Cloud is a great way not to spend any time with that and access the Agent's dashboard through Netdata Cloud. We use WSS and MQTT, so it's super secure and lightweight.
The data are streamed from the Agent directly to your browser via the cloud.
> So the only convenient way to have security is to use the cloud version? Got it.
I wouldn't formulate it that way, it's just a bit annoying for me to see this trend of not having even tiny bit of security built in and having to do extra work just to protect the dashboard. Just one admin account and a random generated password would be fine.
That's the key difference between self-hosted and SaaS. If you self-host, you are responsible for setting up the required infrastructure, taking care of updates, backups etc.
If setting up a reverse proxy behind whatever monitoring you've got is too much, then yes, by all means use the SaaS offering -- but that's 100% the user responsibility, and there's no need to be snarky about it.
> If you self-host, you are responsible for setting up the required infrastructure, taking care of updates, backups etc.
Are you speaking about Netdata or in general? Because if the former, then at least the updates part is not true: the installation script turns out nightly updates (and telemetry).
Frankly, the reason there is no basic auth is that Netdata doesn't use a third-party web server but a built-in one, so they would have to add this functionality.
It's not that it's too difficult, but we were accustomed to having this functionality built in in similar products in the past, then things changed. When ELK first showed up there was a big wave of attacks on ELK servers because they were completely unsecured and at that time X-Pack Security was a paid add-on, they changed their mind later, some time after an open source solution appeared.
Workarounds like proxying by nginx is not a workaround, it's the industry standard way of managing access to services. It's both more convenient and more secure, a rare combination.
More convenient because you can use your companies pre-existing authentication to authenticate the requests, and more secure because you're not having to manage separate passwords and user accounts.
I understand your opinion, but it's not like that everywhere. I work for many clients who have single servers or specific setups and having to configure Nginx is an extra step and an additional layer that could be made totally unnecessary by building in just one admin account and assigning a random password to it.
You can use Netdata Cloud to have secure authenticated access to your single node dashboard. Data remain on your systems and are streamed to your browser. Netdata Cloud stores only metadata.
We have a whole bunch of metrics that we keep track and we are currently implementing a load more.
Soonish, we will greatly increase the number of metrics that we gather with ebpf. That coupled with our per-second granularity, should give you a very detailed view of the system.
It is great, I have claimed my VMs running Netdata to the Netdata Cloud and I am very happy with it! Took me only a few minutes to claim them all (11 VMs) and boom the dashboards were ready out of the box.
I write and maintain an open source monitoring tool and I looked into adding a mode to output metrics in Netdata format and ran away screaming. It's just an unstructured text format where you output commands to stdout, one per line. Each command consists of whitespace-separated fields. Which field is the units? Oh, the 4th. And some fields are optional, I'm not even sure how that works but I think you can't skip an optional field if you then want to use any field after that. It's like structured data formats like JSON or god forbid XML never happened.
A sever admin wants "Incident" panel that only shows anomaly components at the top coupled with adjustable alerting mechanism and not just a dump of all the data there is blindly.
There are so many tools that does this and pretend it's impressive including ELK but whether it's Grafana or Kibana, you need a lot of manual tweaking to make the dashboards actually useful.