Hey everyone, this is a post I've been working on the past few months about sett...

bombcar · on June 7, 2023

What does the monitoring actually do for you? I've seen these setups, even setup one for myself a few times (either Grafana or similar such as Netdata, or Linode's Longview) but I've not really seen what it does for me beyond the "your disk is almost full" warnings.

Cyph0n · on June 7, 2023

I recently setup basic monitoring using Telegraf + Influx + Grafana. Here are the alert triggers, in order of importance (imo):

* ZFS pool errors. Motivator: one of my HDDs failed and it took me a few days to notice. The pool (raidz1) kept chugging along of course.

* HDD and SSD SMART errors

* High HDD and SSD temperatures

* ZFS pool utilization

* High CPU temperature. Motivator: one of my case fans failed and it took a while for me to notice.

* High GPU temperatures. Motivator: I have two GPUs in my tower, one of which I don't really monitor (used for transcoding).

* High (sustained) CPU usage. I track this at the server level, rather than for individual VMs.

bamfly · on June 7, 2023

Setting an email address you actually check in /root/.forward would provide most of this, and all of it with the addition of low-tens of lines of shell script and a cron job or two, no? I get that tastes vary, but adding more services to worry about & keep updated to my home server(s) is not my idea of a good time. I doubt the custom pieces required to get all of those alerts via email would take longer than installing and configuring that stack, and then the maintenance is likely to be zero for so long that you'll probably replace the hardware before it needs to be touched again (... and if you scripted your setup, it'll very likely Just Work on the replacement)

Cyph0n · on June 7, 2023

Oh definitely, but only if you are not interested in the visualization side.

I wanted the ability to quickly see the current & historical state of these and other metrics, not just configure alerts.

I’m also omitting the fact that I have collectors running inside different VMs on the same host. For example, I have Telegraf running on Windows to collect GPU stats.

bamfly · on June 7, 2023

Ah, yeah, that probably won't be enough for you then. Need Windows monitoring, and want the graphs—yeah, much bigger pain to get anything like that working via email.

aseipp · on June 7, 2023

Continuous performance monitoring of a service, from its inception. I'm building a storage service using SeaweedFS and also a web UI for another project. One thing I'm looking at doing is using k6[1] in order to do performance stress testing of API endpoints and web frontends on a continuous basis under various conditions.[2] For example, I'm trying to lean hard into using R2/S3 for storage offload, so my question is: "What does it look like when Seaweed offloads a local volume chunk to S3 aggressively, and what is the impact of that in a 90/10 hot/cold split on objects?" Maybe 90/10 storage splits are too aggressive or optimistic to hit a specific number. Every so often -- maybe every day at certain points, or a bigger global test once a week -- you run k6 against all these endpoints, record the results, and shuffle them into Prometheus so you can see if things get noticeably worse for the user. Test login flows under bad conditions, when objects they request are really cold or large paginations occur, etc.

You can run numbers manually but I think designing for it up front is really important to keep performance targets on lock. That's where Prometheus and Grafana come in. And I think looking at performance numbers is a really good way to help understand systems dynamics and helps you ask why something is hitting some threshold. On the other hand, there are so many tools and they're often fun to play with, it's easy to get carried away. There's also a pretty reasonable amount of complexity involved in setting it up, so it's also easy to just say fuck it a lot of times and respond to issues on demand instead.

[1] http://k6.io/, it's also a Grafana project.

[2] It can test both normal REST endpoints but also browsers thanks to the use of headless chrome/chromium! So you can actually look at first paint latency and things like that too.