> It's easy to get overwhelmed by all the moving pieces
Exactly my thoughts! Isn't there something (open source and as good as Prometheus+Grafana) that doesn't have as many moving parts as the stack used by OP? I can imagine there are many use cases for that: from side projects (homelabs) to small startups that don't have huge distributed systems, but still need monitoring (without relying on third-parties).
Ideally, my setup would be:
- install an agent in each server I'm interesting in gathering metrics from. In this regard, Prometheus works just fine
- one service to handle logs/metrics/traces ingestion and that allows you to search and visualize your stuff in nice dashboards. Grafana works, but it doesn't support logs and traces out of the box (you need Loki for that)
So, basically 2 pieces of software (if they can be installed by just dropping a binary, even better)
I think there's nothing currently that combines both logging and metrics into one easy package and visualizes it, but it's also something I would love to have.
Vector[1] would work as the agent, being able to collect both logs and metrics. But the issue would then be storing it. I'm assuming the Elastic Stack might now be able to do both, but it's just to heavy to deal with in a small setup.
A couple of months ago I took a brief look at that when setting up logging for my own homelab (https://pv.wtf/posts/logging-and-the-homelab). Mostly looking at the memory usage to fit it on my synology. Quickwit[2] and Log-Store[3] both come with built in web interfaces that reduce the need for grafana, but neither of them do metrics.
Side note: it should be possible to tweak some config parameters to optimize the memory usage or cpu usage of quickwit. Ask us on the discord server next time :)
Yeah, I was a little bit surprised it was so close. And I've been using tantivy (the library which powers quickwit afaik) in another side project where it used comparatively less.
Might jump in there then as an excuse to fiddle a bit more with the homelab soon then :)
I use Telegraf (collector) + Influx (storage) + Grafana (visualization and alerting). Telegraf is amazingly simple to use for collection and has a ton of plugins available.
I also started with that stack, but swapped out InfluxDB for Postgres + TimescaleDB extension, which adds timeseries workflows (transparent partitioning, compression, data retention, continuous aggregates, …).
I found InfluxDB to be lacking in terms of permissions management, flexibility regarding queries (SQL, joins), data retention, ability to debug problems. In Postgres, for example, I can look into the execution plan of a statement, log long running queries, and so on.
Telegraf as an agent is very flexible; it has input plugins for every task I could want, and besides it’s default „pull workflow“ (checks on defined interval) I also like to push new metrics directly to the Telegraf inputs.socket plugin from my scripts (backup stats, …).
How do you get data from Telegraf into Postgres/TimescaleDB?
I was interested in swapping out InfluxDB, but it turned out to be somewhat difficult to send data from Telegraf to Postgres. It's not as simple as making an HTTP post, like you can do with InfluxDB.
+1 for Telegraf (with Prometheus and Grafana), rolled out a monitoring stack for our internal network in something like 2 days when a colleague had been manually checking `top` for years each morning. Huge benefit.
My simple-as-dirt solution is generally to use InfluxDB + Grafana. InfluxDB provides a nice HTTP interface that all of my devices simply POST to. I write all the queries myself, because I find that it's a heck of lot easier than to track down individual agents/plugins that actually work.
The closest thing may well be Elasticsearch (and Kibana for visualisation), if you are fine with the Elastic license. As its document format is very flexible, it can be used to store logs, metrics, and traces.
It'll be a solution inferior to specialised tools like Prometheus, Mirmir, Tempo, though. And some may be put off by the difficulty of running Elasticsearch.
Alternatives could be other general purpose databases.
Telegraf is a single agent that collects a nice amount of metrics and send it to many databases. I prefer to use telegraf (and scripts) to collect the metrics into influxdb and then grafana.
Telegraf have some log parsing/extraction functionality, but for something more generic promtail+loki would be better.
I've had great success using this helm chart to install the entire stack into my EKS clusters. Even if you're not using Kubernetes, it's still a useful example for how everything should fit together.
https://github.com/prometheus-community/helm-charts/tree/mai...
Ditto. I've recently just completed a migration from Thanos to Mimir, and I've found that its much easier to operate and administrate. Still stuck on Elastic for logs but I'm slowly convincing developers Loki can be just as effective.
...and also for one of my side projects, OSRBeyond.
It's easy to get overwhelmed by all the moving pieces, but it's also a lot of _fun_ to set up.