Why Big Monitoring Software Sucks

suprgeek · on June 21, 2012

It is easy to beat up on big complicated monitoring software - call it Enterprise Level and then proceed to find flaws. There is a reason it is complicated - because it tries to solve Tough problems: Can these other offerings give me:

1) High-availability with minimal downtime (seconds)

2) I18N readiness & L10n language packs

3) Scale on the order of 10000 or 100K

4) Guaranteed delivery of Alerts and Metrics

5) Easy Deployment and configuration - where the first step is NOT download and deploy Redis and configure it.

Make no mistake I love open source offerings but having worked in this domain for a while, Enterprise software becomes complicated to solve complicated problems with minimal intervention by users - not every IT shop has super-duper DevOps ninjas who Crunch Machine Learning Pardigms for breakfast.

gouranga · on June 21, 2012

Excuse the cynical POV here, but it's from a number of years of dealing with such things. The problems I find with enterprise software as a rule are it solves the above but fails at:

1) Realistic scaling. Just about every piece of enterprise grade software I've seen falls off a cliff when it hits a certain load point. This is ALWAYS enough to sell it to you. When it goes wrong...

2) When it goes wrong, it's a nightmare of maintenance contracts, verification, finger pointing and telephone calls that last hours.

3) It rarely works as advertised. I mean literally 1% works as advertised.

4) Trials and realistic evaluations without crazy constraints are not usually possible (this is changing slowly).

5) Installation is a breeze but when it comes to backup/restore and upgrade, it's a pain in the butt.

Time is money. The difference between "enterprise level software" and "OSS" platforms are the following:

* OSS time is spent up front.

* Enterprise time is spent later on and costs more up front.

I'd rather have one Ninja on the team and lose the enterprise software. Ninjas scale better as well.

edbloom · on June 21, 2012

this.

I do a lot of work in Enterprise CMS land and Open Source land and you've essentially described my experiences working with ALL Enterprise CMS's I've used since the early 00's. I'm pretty convinced that within about 18 months Open Source CMS's will have negated any residual technical benefits that Enterprise platforms offer.

At this point the only competitive advantage Enterprise CMS vendors will have is their reputation and the "nobody ever got fired for buying IBM" attitude and commercial support.

With many other businesses sprouting up to provide commercial support for open source platforms I can see many Enterprise vendors having to pull up their socks and shaking up how they sell and manage their platforms in the very near future.

mrdodge · on June 21, 2012

My favorite aspect of this is that enterprise support is often very bad and unknowledgeable about the product if you try to do something that's the least bit uncommon or outside their script.

You would think people focused on knowing just one product would know something about it.

gouranga · on June 21, 2012

Not at all - they know shit. Unfortunately "enterprise software companies" hire the lowest bidder in the lowest bidding country and they hire the lowest bidding staff.

This is especially true of Microsoft who I've had the pleasure of dealing with their highest level of Gold Partner support with respect to an IE9 bug that broke ClickOnce entirely.

Basically: absolutely fucking useless, blame the client, wriggle out of having to do anything.

That was until they met me. 35 phone calls (I shit you not), 3 heated arguments, spread bad press all over stackoverflow and MS connect, blog whinge and finally a half arsed registry fix that we had to deploy across 2000 disparate clients!

6 fucking months it took and we're a MS Gold Partner. It cost us more than our subscription cost in bad rep, time and support costs.

shanemhansen · on June 21, 2012

My experience buying enterprise software (an ecommerce platform):

1. Support? Don't make me laugh. Maybe once you've delivered to them a paper trail 2 miles long proving beyond a shadow of a doubt that your installation is compliant with their recommendations, you might get through to a low level guy who will give you advice like: "That's a lot of products, can you reduce the number of products you sell on your site"?

2. Ease of use? Installation of this software was a total nightmare. It basically only operated in 2 modes. 1) The wizard/demo mode which could not be used for production. 2) Find a consultant with the magic set of ancient ant build scripts to pass around.

3. Scaling? Theoretically built in, in practice just as difficult as scaling any other framework. It only requires you get the above consultant to talk to another consultant to get the super secret scripts which jump past the orm and hit the database (which by the way isn't supported by the product anyways).

4. Most features were either implemented just far enough to convince CTO's there was a check in that feature box or were so "generic", that all you had to do was "write a class which implements the feature" and plug it in. For this (the privilege of being able to inject code using their home-grown dependency injection system), we payed through the nose.

5. Backup/Restore? Literally a f!@#! unsolved problem. Required a pretty much simultaneous snapshot of the filesystem of every app server as well as all the databases.

After several months I became convinced that the only reason this super expensive platform existed was to provide jobs for consultants. Then I (briefly) became one before joining a startup.

josephruscio · on June 21, 2012

Could not agree more. This is the approach we've strived to take while building https://metrics.librato.com ... API access for everything, integration with popular OSS for metric collection (e.g. statsd, collectd), loosely coupled integration with complementary tools (e.g. Papertrail, Pagerduty), etc.

obfuscurity_ · on June 21, 2012

@josephruscio - I was >this< close to going back and giving a quick mention to Librato Metrics. You guys definitely "get" what I'm talking about. I like that you provide well-defined interfaces for easily getting data into and out of your application. You focus on trending and let other (better?) software handle the other stuff.

sigre · on June 20, 2012

So, what would a "small, sharp" monitoring tool look like that's compelling to customers? Just an agent, and the customer supplies their own notification and trend-reporting services?

sciurus · on June 21, 2012

For example, http://www.control-alt-del.org/2012/03/28/collectd-esper-amq...

obfuscurity_ · on June 21, 2012

I think Pingdom applies to some extent. In particular because it allows you to get the raw data out which can then by imported into trending software or notification services elsewhere.

ghshephard · on June 21, 2012

To some degree, the best "monitoring software" doesn't even monitor at all - it simply provides a framework for people to add their monitors to. To take one example I'm familiar with (though I'm sure this could be applied to the rest of the great monitoring systems) - Nagios is, at it's heart, a State-Tracking/Notification/Scheduling Engine. The fact that you can add commands like "ping" or "http test" or "Database Schema Verification" to the tasks that you are scheduling and tracking the state on, is almost incidental to the core of what Nagios does.

spudlyo · on June 21, 2012

I agree completely, and I think Nagios does a good job at state tracking, notification, and scheduling. The interface is a bit clunky, but with the nuvola makeover it's completely serviceable. Pair it with something like ganglia or cacti for visualization and you're most of the way there.

LaSombra · on June 21, 2012

Take a look at Opsview. It runs on top of Nagios 3 and makes you not edit those pesky configuration files and integrates with MRTG. Pretty neat.

beedogs · on June 21, 2012

I swear I've seen HP OpenView processes using more CPU than Oracle. I have no idea why companies throw away so much money on such terrible software.

spudlyo · on June 21, 2012

Monitoring systems that have to execute thousands of active checks every polling interval consume a bit of CPU. I've certainly seen high CPU usage and run queue depth on a busy Nagios server.

beedogs · on June 21, 2012

On a central server, I can understand. These processes were running on clients, though, and consuming as much memory and CPU as they could find.

antoncohen · on June 21, 2012

Sounds like Sensu (https://github.com/sensu/sensu) fits the requirements. It's a small monitoring framework that works in conjunction with Chef or Puppet, PagerDuty, Librato, Graphite, and others.

sciurus · on June 21, 2012

Sensu came up several times in the recent monitoring discussion on devops-toolchain [0]. I'm going to try a setup with sensu, graphite, and collectd soon- like the one Sean Escriva described at ChefConf [1].

[0] https://groups.google.com/group/devops-toolchain/browse_thre... [1] http://www.youtube.com/watch?v=BXxtdE-Paco

tpsreport · on June 20, 2012

Nice article, whose central message ("don't bundle your tools") is widely applicable to domains other than monitoring. Small, specialized tools that can be combined together is the very essence of the Unix philosophy.

dasil003 · on June 21, 2012

The problem is particularly acute in monitoring because you tend to need to monitor many disparate systems and subsystems in a very fine-grained manner. It needs to be reliable and not impact performance no matter where it is run which usually has to be everywhere. The Unix philosophy is definitely useful on a wider basis, but monitoring could be the poster child.

shortlived · on June 21, 2012

Yeah but does it scale?

Edit: I should clarify, I really liked the article and I like the unix philosophy but I'm truly interested to know what type of networks can be monitored with this approach. I work in the enterprise monitoring space and a lot of the solutions are truly horrible, but on the other hand, I'm not sure this approach would work either when you have 500K managed nodes and dealing with multiple campuses across 3 continents...

obfuscurity_ · on June 21, 2012

Yes, it certainly /can/ scale. In fact, a properly modularized monitoring solution /should/ be more capable of scaling if it was designed with this in mind. This is certainly the approach we've taken with our monitoring and trending components at Heroku. We don't have anywhere near 500K nodes, but the principles scale.