3 years ago I started a company (Stackify) with the hopes of building a better nagios. But it doesn't seem like developers really wanted it. They wanted and needed a lot more as basic server metrics didn't really tell much of a story about application health. The shift to cloud based services also makes a lot of basic monitoring tools unnecessary. Cloud based apps don't really need to monitor servers or infrastructure beyond simple CPU and memory measurement. Developers need to monitor the app itself. Which can really only be done by code profiling, custom metrics, and analyzing errors and log statements.
So we since pivoted a little bit and have focused heavily on true application monitoring via basic server metrics, custom app metrics, error tracking, log management, and true APM code profiling. All of this together provides a lot of power when it comes to monitoring and finding application problems.
A lot of companies we talk to barely monitor anything about their apps. So many IT teams work in such a reactive mode they aren't very proactive when it comes to monitoring application health and behavior.
Would love to get anyone's feedback about this topic. Do you just use basic server monitoring? How detailed do you monitor the actual behavior and health of your application? How do you do it?
At Stack Exchange we use Bosun to monitor both system and application metrics. As an application developer I get pretty much everything I need on the system side out of the box with scollector. We spend a fair amount of time on projects now developing application-level metrics, which we use BosunReporter [0] to send to the same instance. It's quite useful to have application and system metrics on the same tool.
It really needs to be minimal-effort to get started. I use newrelic and takipi where I can, and the most crucial thing is that they don't need any ahead-of-time configuration - I can slap them on in a generic configuration and be confident that they'll alert appropriately (newrelic) or collect the right information about failures (takipi).
I don't wanna derail the comments here, since as cbaleanu mentioned it's pretty off topic, so I'll try to stay brief here. If you would like some more input, my email is in my profile.
> But it doesn't seem like developers really wanted it.
I don't see why (most) developers would want a monitoring system. Why not target sysadmins / devops? In my experience developers really just want a trace of their app when it breaks, or maybe some performance profiling.
> The shift to cloud based services also makes a lot of basic monitoring tools unnecessary.
Even though on HN it might seem like everyone is using "the cloud" for everything, the reality is quite different. There's a pretty sizable market for server monitoring, even in the case of cloud services, someone is managing the metal.
> code profiling, custom metrics, and analyzing errors and log statements.
newrelic provides a lot of these, although I'm personally not a fan (it seems like their daemon is the main thing causing resource alerts), and has a free tier.
>So many IT teams ... aren't very proactive when it comes to monitoring application health and behavior.
Well if you're talking to IT teams, they may not be able to do anything proactive regarding health / behavior of applications. In many cases the team making sure the service stays up has no control over the code or inner working of the application. They're limited to configuring and changing supporting servers (database, lb, web server), and not the application itself, and the application state is treated as a binary "is it up or down", with "turn it off and back on again" being a common "fix".
Looking at the product, it does look quite nice, but collectd, icinga, nagios, zabbix (and now bosun which I'll likely be switching to) is free and open source server monitoring, and newrelic with all its in depth monitoring has a free tier. Assuming that virtual servers (xen) are considered a separate server, I definitely can't see paying $15/mo per server/instance. I'd be paying an order of magnitude more than the colo for my servers costs me. However with 5 physical servers, $75/mo is a lot more reasonable than (at minimum) $600 I'd be paying per instance. At work, our 4000+ servers would cost $60000 a month, more than enough that we could hire developers to write a custom system. With so many free and/or open options, it'd really have to blow me away to be worth it. Additionally a free trial for something like monitoring is a hard sell. The time and effort invested into the setup of a new monitoring system is definitely not free.
All that being said, I do like it, and it seems like a nice product. From what you say I feel like you just haven't found the right target market for it.
Most developers in most IT departments don't even have access to monitoring tools and very little of developer/application importance is monitored beyond server up/down, cpu, and memory usage.
A lot of the most important monitoring I do for my own application is looking at page load times, slow DB queries, custom app metrics, error rates, and looking for specific log statements. Many of these things can't even be done in basic monitoring systems.
Bosun's alerting rules look awesome. But most people don't even know what to monitor, let alone figure out how to write javascript expressions to do so.
I tried it for a while and it sure has potential has one of those modern monitoring systems that are replacing nagios right now.
However, in production I would not want to run it in a docker. I would want to setup my own server with option to scale it to remote pollers.
In my org we ended up choosing another nagios replacement, but not because of any flaw in bosun.
I love iterating over the main points that we look for in a monitoring solution.
Self-hosted. Scalable, remote pollers that can plugin to the central servers. Locations, remote pollers can add locations to monitor from. Collector agent that runs periodically from monitored servers instead of the nrpe model that listens to connections. The collector OS agent is windows compatible and backwards compatible with nagios scripts. Monitoring focuses on sending metrics first and foremost, so you can set thresholds for metrics, just like bosun does. And of course, with those metrics the web gui draws fancy graphs for everything.
And last but not least, all of this, monitoring agent, pollers, they all use a standard API like REST or xmlrpc.
> At Stack Exchange we do not use Docker in production. For those that do not wish to use docker, we provide binaries for bosun at bosun.org, but you will also need to install OpenTSDB and HBase yourself.
Correct, our docker instance is really designed towards letting people play with it quickly without all the trouble of a production setup.
Stack Exchange's production setup isn't documented, but we use Cloudera for HBase, relay all data through the tsdbrelay cmd (found in the bosun repo), have HAProxy in front of it all.
* Scalable: 1/2 * √. The web interface doesn't have pagination, so series with larget tagsets can cause some GUI problems. We have been having some trouble with OpenTSDB lately. People have much larger OpenTSDB installation than we do though, HBase isn't one of our best skills. But it is scaling okay for Stack Exchange
* Remote Poller: √. Our agent Scollector can run in a polling mode for things that need polling like SNMP, VSphere, etc
* Windows Support: √. This is one of the main reasons we built scollector. It is a single binary with no dependencies. We spend a lot of time digging into the WMI Raw performance counters to get the best data we could.
* Backward Compatible with Nagios Scripts. Negative, but sc ollector can use external scripts, but they are not in the nagios format.
* Thresholds: √ This is the most basic form of alerting Bosun does. You can also construct forecast, anomalous, and multiple condition alerts as well. The power in what you can do with alerting is really where Bosun shines.
Not the OP, but we went with Icinga2. Aside from some crappy pre- and post-upgrade scripts, it works remarkably well, and is compatible with the Nagios monitoring plugin ecosystem.
Add in some Skyline (based off the Etsy project), graphite, and collectd, and it makes for a flexible and extensible monitoring solution.
I want to avoid a shameless plug because the only reason we chose this solution was because we bought the company that make it so we essentially own it now.
For anyone else it breaks the self-hosted requirement since it's a Monitoring as a service called monitorscout.com.
God it's time someone came up with a good, modern monitoring system. I used Nagios for years but it never evolved past a bunch of CGI scripts written in C(!). I tried Sensu, and was moderately impressed until a major update broke everything and it never worked again.
I've used Nagios in quite big installations. There are plenty of things that can be improved, and has been in the surrounding ecosystem, but that's just paiting a false picture. Nobody runs CGI anymore, and even if they did, they're not in C and never has been. Some of the checks are, of course.
I've had an intern working to set up Bosun and OpenTSDB on an Ubuntu server, from source, for a few weeks now. He's close, but today is his last day.
I'd need to pay someone to professionally set this up for us (so we can easily distribute it with our enterprise software), preferably with just bash scripts. I also need consulting. Like, is it realistic to use a single server for our logging load?
I work for a large multi-national. If you're qualified, and interested, we can engage you to help us out. Contact is in my profile.
It seems to be a DSL to describe alerts over whole clusters. That's probably what a monitoring system for the cloud age should. It can monitor Logstash and Graphite, which are proven ways to collect data in a disparate environment.
But many in the comments compare it with Nagios which I think isn't really fair. You could probably easily plug this into Nagios and it's dependency rulework can figure out who to page when. Because that's what Nagios is, not the default checks it ships with.
I just finished work on adding much richer snmp polling ability to scollector (the data collection agent that goes along with bosun).
I don't think its fully documented yet, but I can help you get started. Join us in our slck room if you wanna talk about it: http://bosun.org/slackInvite.html
It looks like since you define both the queries and the inputs, you can set it up to support SNMP. You can use Logstash to listen for SNMP traps, and push it to your OpenTSDB or Elasticsearch instance, from where Bosun will pull it.
The first ~13 minutes is some of the design thoughts, the why etc. Then I start a demo with some screencasts.