Bosun – open-source monitoring and alerting system by Stack Exchange

KyleBrandt · on July 23, 2015

I did a presentation on Bosun at the most recent Monitorama conference: https://vimeo.com/131581326

The first ~13 minutes is some of the design thoughts, the why etc. Then I start a demo with some screencasts.

gbrayut · on July 23, 2015

And if you are a developer that likes working on these sort of things you should come work with us!

Details at https://careers.stackoverflow.com/jobs/92395/site-reliabilit...

clebio · on July 23, 2015

Thanks for this. That first 13 minutes is a really good thought process. I'm sharing with my team.

spo81rty · on July 23, 2015

3 years ago I started a company (Stackify) with the hopes of building a better nagios. But it doesn't seem like developers really wanted it. They wanted and needed a lot more as basic server metrics didn't really tell much of a story about application health. The shift to cloud based services also makes a lot of basic monitoring tools unnecessary. Cloud based apps don't really need to monitor servers or infrastructure beyond simple CPU and memory measurement. Developers need to monitor the app itself. Which can really only be done by code profiling, custom metrics, and analyzing errors and log statements.

So we since pivoted a little bit and have focused heavily on true application monitoring via basic server metrics, custom app metrics, error tracking, log management, and true APM code profiling. All of this together provides a lot of power when it comes to monitoring and finding application problems.

A lot of companies we talk to barely monitor anything about their apps. So many IT teams work in such a reactive mode they aren't very proactive when it comes to monitoring application health and behavior.

Would love to get anyone's feedback about this topic. Do you just use basic server monitoring? How detailed do you monitor the actual behavior and health of your application? How do you do it?

If you're curious you can check out our product. http://stackify.com

couchand · on July 23, 2015

At Stack Exchange we use Bosun to monitor both system and application metrics. As an application developer I get pretty much everything I need on the system side out of the box with scollector. We spend a fair amount of time on projects now developing application-level metrics, which we use BosunReporter [0] to send to the same instance. It's quite useful to have application and system metrics on the same tool.

[0]: github.com/bretcope/BosunReporter.NET

lmm · on July 23, 2015

It really needs to be minimal-effort to get started. I use newrelic and takipi where I can, and the most crucial thing is that they don't need any ahead-of-time configuration - I can slap them on in a generic configuration and be confident that they'll alert appropriately (newrelic) or collect the right information about failures (takipi).

Tiksi · on July 23, 2015

I don't wanna derail the comments here, since as cbaleanu mentioned it's pretty off topic, so I'll try to stay brief here. If you would like some more input, my email is in my profile.

> But it doesn't seem like developers really wanted it.

I don't see why (most) developers would want a monitoring system. Why not target sysadmins / devops? In my experience developers really just want a trace of their app when it breaks, or maybe some performance profiling.

> The shift to cloud based services also makes a lot of basic monitoring tools unnecessary.

Even though on HN it might seem like everyone is using "the cloud" for everything, the reality is quite different. There's a pretty sizable market for server monitoring, even in the case of cloud services, someone is managing the metal.

> code profiling, custom metrics, and analyzing errors and log statements.

newrelic provides a lot of these, although I'm personally not a fan (it seems like their daemon is the main thing causing resource alerts), and has a free tier.

>So many IT teams ... aren't very proactive when it comes to monitoring application health and behavior.

Well if you're talking to IT teams, they may not be able to do anything proactive regarding health / behavior of applications. In many cases the team making sure the service stays up has no control over the code or inner working of the application. They're limited to configuring and changing supporting servers (database, lb, web server), and not the application itself, and the application state is treated as a binary "is it up or down", with "turn it off and back on again" being a common "fix".

Looking at the product, it does look quite nice, but collectd, icinga, nagios, zabbix (and now bosun which I'll likely be switching to) is free and open source server monitoring, and newrelic with all its in depth monitoring has a free tier. Assuming that virtual servers (xen) are considered a separate server, I definitely can't see paying $15/mo per server/instance. I'd be paying an order of magnitude more than the colo for my servers costs me. However with 5 physical servers, $75/mo is a lot more reasonable than (at minimum) $600 I'd be paying per instance. At work, our 4000+ servers would cost $60000 a month, more than enough that we could hire developers to write a custom system. With so many free and/or open options, it'd really have to blow me away to be worth it. Additionally a free trial for something like monitoring is a hard sell. The time and effort invested into the setup of a new monitoring system is definitely not free.

All that being said, I do like it, and it seems like a nice product. From what you say I feel like you just haven't found the right target market for it.

cbaleanu · on July 23, 2015

Just curious, how is this related to the article on which you're commenting?

spo81rty · on July 23, 2015

Most developers in most IT departments don't even have access to monitoring tools and very little of developer/application importance is monitored beyond server up/down, cpu, and memory usage.

A lot of the most important monitoring I do for my own application is looking at page load times, slow DB queries, custom app metrics, error rates, and looking for specific log statements. Many of these things can't even be done in basic monitoring systems.

Bosun's alerting rules look awesome. But most people don't even know what to monitor, let alone figure out how to write javascript expressions to do so.

KyleBrandt · on July 23, 2015

Bosun uses a custom DSL, not javascript. What you are saying is fair - bosun is currently targeted at an advanced audience currently.

There is active discussion about making alerts GUI creatable to make it more accessible. Can hopefully be less vague about that in a few weeks.

INTPenis · on July 23, 2015

I tried it for a while and it sure has potential has one of those modern monitoring systems that are replacing nagios right now.

However, in production I would not want to run it in a docker. I would want to setup my own server with option to scale it to remote pollers.

In my org we ended up choosing another nagios replacement, but not because of any flaw in bosun.

I love iterating over the main points that we look for in a monitoring solution.

Self-hosted. Scalable, remote pollers that can plugin to the central servers. Locations, remote pollers can add locations to monitor from. Collector agent that runs periodically from monitored servers instead of the nrpe model that listens to connections. The collector OS agent is windows compatible and backwards compatible with nagios scripts. Monitoring focuses on sending metrics first and foremost, so you can set thresholds for metrics, just like bosun does. And of course, with those metrics the web gui draws fancy graphs for everything.

And last but not least, all of this, monitoring agent, pollers, they all use a standard API like REST or xmlrpc.

drewnoakes · on July 23, 2015

From the quickstart page:

> At Stack Exchange we do not use Docker in production. For those that do not wish to use docker, we provide binaries for bosun at bosun.org, but you will also need to install OpenTSDB and HBase yourself.

KyleBrandt · on July 23, 2015

Correct, our docker instance is really designed towards letting people play with it quickly without all the trouble of a production setup.

Stack Exchange's production setup isn't documented, but we use Cloudera for HBase, relay all data through the tsdbrelay cmd (found in the bosun repo), have HAProxy in front of it all.

One of our more committed users did document his production setup at https://medvedev.io/blog/posts/2015-06-21-bosun-install-1.ht...

KyleBrandt · on July 23, 2015

As far as Bosun goes:

* Self-Hosted: √. (We didn't want cloud monitoring at Stack Exchange)

* Scalable: 1/2 * √. The web interface doesn't have pagination, so series with larget tagsets can cause some GUI problems. We have been having some trouble with OpenTSDB lately. People have much larger OpenTSDB installation than we do though, HBase isn't one of our best skills. But it is scaling okay for Stack Exchange

* Remote Poller: √. Our agent Scollector can run in a polling mode for things that need polling like SNMP, VSphere, etc

* Windows Support: √. This is one of the main reasons we built scollector. It is a single binary with no dependencies. We spend a lot of time digging into the WMI Raw performance counters to get the best data we could.

* Backward Compatible with Nagios Scripts. Negative, but sc ollector can use external scripts, but they are not in the nagios format.

* Thresholds: √ This is the most basic form of alerting Bosun does. You can also construct forecast, anomalous, and multiple condition alerts as well. The power in what you can do with alerting is really where Bosun shines.

newman314 · on July 23, 2015

Which nagios replacement did you chose?

falcolas · on July 23, 2015

Not the OP, but we went with Icinga2. Aside from some crappy pre- and post-upgrade scripts, it works remarkably well, and is compatible with the Nagios monitoring plugin ecosystem.

Add in some Skyline (based off the Etsy project), graphite, and collectd, and it makes for a flexible and extensible monitoring solution.

INTPenis · on July 23, 2015

I want to avoid a shameless plug because the only reason we chose this solution was because we bought the company that make it so we essentially own it now.

For anyone else it breaks the self-hosted requirement since it's a Monitoring as a service called monitorscout.com.

smegel · on July 23, 2015

God it's time someone came up with a good, modern monitoring system. I used Nagios for years but it never evolved past a bunch of CGI scripts written in C(!). I tried Sensu, and was moderately impressed until a major update broke everything and it never worked again.

xorcist · on July 23, 2015

I've used Nagios in quite big installations. There are plenty of things that can be improved, and has been in the surrounding ecosystem, but that's just paiting a false picture. Nobody runs CGI anymore, and even if they did, they're not in C and never has been. Some of the checks are, of course.

smegel · on July 23, 2015

Hmmm https://github.com/NagiosEnterprises/nagioscore/tree/master/...

sciurus · on July 23, 2015

In addition to Bosun, check out http://prometheus.io/

euroclydon · on July 23, 2015

I've had an intern working to set up Bosun and OpenTSDB on an Ubuntu server, from source, for a few weeks now. He's close, but today is his last day.

I'd need to pay someone to professionally set this up for us (so we can easily distribute it with our enterprise software), preferably with just bash scripts. I also need consulting. Like, is it realistic to use a single server for our logging load?

I work for a large multi-national. If you're qualified, and interested, we can engage you to help us out. Contact is in my profile.

KyleBrandt · on July 23, 2015

If you find someone make sure they know about our slack chat room. They can get an invite via http://bosun.org/slackInvite.html

bbrazil · on July 23, 2015

How many servers and metrics are you expecting?

I know for prometheus.io we can handle at least 1M metrics per server.

euroclydon · on July 23, 2015

I had not heard of Prometheus. Thanks. Any reason it would not run on Windows?

bbrazil · on July 23, 2015

We'd like it to run on Windows, however none of the core developers use Windows.

https://github.com/prometheus/prometheus/issues/505

We haven't head back recently, so it's possible it's all working now.

euroclydon · on July 23, 2015

Cool. If/when we get to it, I'll let you know.

xorcist · on July 23, 2015

It seems to be a DSL to describe alerts over whole clusters. That's probably what a monitoring system for the cloud age should. It can monitor Logstash and Graphite, which are proven ways to collect data in a disparate environment.

But many in the comments compare it with Nagios which I think isn't really fair. You could probably easily plug this into Nagios and it's dependency rulework can figure out who to page when. Because that's what Nagios is, not the default checks it ships with.

KyleBrandt · on July 23, 2015

Bosun really is different in many ways than Nagios. There are 4 major components that are a the root of Bosun:

* Time Series Database: Lets you do forecasting and anomalous alerts. Basically lets your alerts have context

* Expression Language so you can manipulate that data: Makes the data you collect and how you alert more orthogonal

* Templating for alerts (Built on Go templates, can include graphs, tables, links, etc)

* IDE interface for developing and alert handling workflow. It also lets you test alerts against history which allows for rapid alert development.

So other than "it does alerting" it is a very different beast from Nagios.

Gideonnn · on July 23, 2015

I browsed through the website, but couldn't find if it had SNMP support. If it has, it can maybe replace Zabbix for our company.

captncraig · on July 23, 2015

developer here.

I just finished work on adding much richer snmp polling ability to scollector (the data collection agent that goes along with bosun).

I don't think its fully documented yet, but I can help you get started. Join us in our slck room if you wanna talk about it: http://bosun.org/slackInvite.html

jrgnsd · on July 23, 2015

It looks like since you define both the queries and the inputs, you can set it up to support SNMP. You can use Logstash to listen for SNMP traps, and push it to your OpenTSDB or Elasticsearch instance, from where Bosun will pull it.

Gideonnn · on July 23, 2015

That sounds pretty good. I'll have a more in depth look later. Thanks!

mjibson · on July 23, 2015

Here are the things it can monitor out of the box: http://bosun.org/scollector/