I’ve been testing the whole stack on a local server, finding kinks, documenting workflows, because I hope I move us to this soon. Now with Pyroscope I would love to try it out even more. I kid you not, we were just testing Datadog Continuous Profiling for our legacy Ruby application not two days ago and it was quite lackluster.
Not saying this will be better (have yet to try), but I’d prefer to support and report feedback back on this for a brighter future for our community.
Ryan here from Pyroscope -- Awesome to hear you're interested in migrating over. Would love for you to try out our Ruby integration and let us know what you think: https://pyroscope.io/docs/ruby/.
Hey! It was alright, but at least for Ruby it didn’t report on the memory allocation of the calls, which is what we were looking for in the first place. We do not use it anywhere else, we only tried it for this pesky legacy system.
Its very much our aim to make this mix of self-hosted and cloud services as easy as going all-cloud; but I agree we're not quite there yet.
Do you mind if I ask what isn't super-easy about linking self-hosted loki search queries with SaaS-Prometheus? You should be e.g. able to add a Prometheus data source to your local Grafana (or securely expose your Loki to the internet and add a Loki data source to your Cloud Grafana)
Honestly I haven't tried that much, but didn't find anything in the docs so I assumed it wasn't a prioritized area.
In our particular scenario, we'd probably want to run Loki + Grafana locally, and then hosted Prometheus + hosted Grafana for metrics.
But would be great if we could just tell the two about each other, and under which domains they exist. That way, Prometheus-grafana could construct URLs that linked straight into Loki-grafana (that we host) for e.g. the same interval, or the same label filter (GET params).
But it would only work if I (the end-user) had access to both. That way, we don't have to expose Loki to the internet. But linking would still work.
There are quite a lot of services that does this with Github and commits. You can link from e.g. Bugsnag to Github by only telling Bugsnag your org and repo names. But Bugsnag won't have read access to Github (they also have another integration method which does require access, but that's not the one I'm talking about here).
Those types of "linking into a known URL pattern of another service" integrations are easy to setup and very easy to secure.
We are definitely investigating columnar formats in Tempo to store traces. We expect it to drastically accelerate search as well as open up more complex querying and eventually metrics from distributed tracing data.
However, we are currently primarily targeting Parquet as our columnar format in object storage.
It doesn't have to mean the end for Cortex, but others will have to step up to lead the project. We've tried to put other maintainers in place to kick start this.
(Tom here; I started the Cortex project on which Mimir is based and lead the team behind Mimir)
Thanos is an awesome piece of software, and the Thanos team have done a great job building an vibrant community. I'm a big fan - so much so we used Thanos' storage in Cortex.
Mimir builds on this and makes it even more scalable and performance (with a sharded compactor and query engine). Mimir is multitenant from day 1, whereas this is a relatively new thing in Thanos I believe. Mimir has a slightly different deployment model to Thanos, but honestly even this is converging.
Generally: choosing Thanos is always going to be a good choice, but IMO choosing Mimir is an even better one :-p
Okay, but why? I am using Thanos today. It works, it's complex, when it breaks, it's a bit of a challenge to fix, but it happens. It doesn't break often.
It does the job. Mimir, which is based on Cortex, using either Mimir, or Cortex, what benefit am I getting?
I get asked every few months about moving off of Thanos to Cortex, and today now Mimir, and I don't have any substantial reason to do so. It feels like moving for the sake of moving.
I need to see some real reasoning as to why I am going to add value to move everything to Mimir.
Sounds like Thanos is working well for you, so in your position I wouldn't change anything.
There are a bunch of other reasons why people might choose Mimir; perhaps they have out grown some of the scalability limits, or perhaps they want faster high cardinality queries, or a different take on multi-tenancy.
Do remember Cortex (on which Mimir is based) predates Thanos as a project; Thanos was started to pursue a different architecture and storage concept. Thanos storage was clearly the way forward, so we adopted it. The architectures are still different: Thanos is "edge"-style IMO, Mimir is more centralised. Some people have a preference for one over the other.
That's fair, thanks for the input. The only reason we implemented Thanos in the first place was a particular feature that we needed at the time of implementation. Now using it in an extremely large environment, I haven't seen any scalability limits. Speed of queries isn't a driver of anything.
Multi Tenancy certainly is, but we have our own custom multi tenancy solution over top of it we built ourselves. I'd like to get rid of that ultimately, but we're not utilizing whatever multi tenant features exist at the moment. Perhaps that will be a driver.
We were struggling with Cortex a couple years ago, then we tried VictoriaMetrics and haven't look back. It goes pretty much unattended with just monitoring disk space to make sure we still have room to continue pouring in metrics.
When a component crashes (not often) it recovers pretty much without noticing.
Multi-tenancy is something that shouldn't be underestimated. A lot of people think it's just a checklist item until (a) they need it or (b) they try to implement it in an existing system. Kudos for making it a day-one feature.
While I agree with your point in the general case, would you mind elaborating on the specific case of Prometheus?
My understanding is that the recommended best-practice for Prometheus is to deploy as many of them as necessary, as close to the monitored infrastructure as possible.
What use case would require deploying a single Mimir, so supposedly Prometheus (cluster) in the case of serving multiple tenants? Why not just deploy a dedicated Prometheus / Mimir stack per client?
I don't know Prometheus, but I would imagine the answer depends on just how many clients you have. Probably doesn't matter if you're talking just a few. If it's a lot, then separate instances can be very expensive in terms of operational complexity and waste due to resource fragmentation. Multi-tenancy is good for bringing both of those back under control. Is there something about Prometheus that would negate that?
For one, it doesn't really support authentication (although it's on the roadmap).
I'm no Prometheus expert, but since you're pretty much expected to be running a bunch of servers anyway, the operational complexity has to be handled even for just one client.
You do have a point on resource fragmentation, but IME Prometheus' resource usage is fairly predictable, so you could probably mitigate that to a point.
Would it be possible to have a split offering, with both on prem and cloud? In my mind I would prefer to have things like Prometheus, Logs, and Metrics stored on prem mainly due to the volume of logs and metrics we create. Then use Grafana cloud for Grafana Dashboards, Loki logs, and incident management that pull directly from my on prem data stores. I bring this up as it may be cost prohibitive for us to store our metrics in the cloud ( we make so many metrics and logs! ) but I would love to off load hosting the front end. Grafana cloud takes care of managing and maintaining Grafana Dashboard and backend database, Authentication, updates, ect. I'm fine hosting Prometheus and Loki locally, have been for a long time! I just get annoyed having to host Grafana and setting it up, the database up, configuring auth, etc.
I'm curious about this part, and I can absolutely understand if you don't want to answer but I do have the following question:
Why is it tricky to ensure an application can run on a cloud deployed system or a local Kubernetes/Docker Swarm/newfangle containerization mechanism of choice/etc. system?
Specifically I'm wondering what barriers you're running into that are pushing the focus to go cloud only.
Grafana Labs is hiring! Come work on Grafana, Prometheus, Cortex, Loki, Tempo and more - lots of opensource source, with both a SaaS and an enterprise team. We use Golang, JS, Typescript, Kubernetes, Jsonnet, Tanka, CUE and more.
We're growing fast, have lots of happy customers and many exciting projects in the pipeline. We need good engineers to help us take Grafana and Prometheus to the masses.
Not sure why this is being downvoted... Hiring people to fix things is a lot more reasonable (and a good idea!) than asking for free labour when you have $50 million in the bank.
mind elaborating? we built loki for some pretty massive scale but I've always tried to make it work at super small scale to. what went wrong?