Monday, August 22, 2011

Second Post on SolarWinds Tricks

http://leonadato.blogspot.com/2011/05/i-posted-this-over-on-thwack.htmlThis is the second part of a 3 part series I posted over on www.thwack.com about ways to make their premier toolset - Solarwinds Orion network performance manager (NPM) jump through hoops. You can find the first post here (or on Thwack, here)



This is the second in a series of posts where, in the name of giving back to the community, I'm going to share some of the customizations that make SolarWinds a little more robust for us and our customers.

First, a little background about my company and how we use SolarWinds. Sentinel is an IT solutions provider that focuses on communications technologies, Data Center, and Outsourced / Managed Solutions.

One of our key services (and the thing that lets me put food on the table) is a remote monitoring solution (based on SolarWinds, of course). All we have to do is drop a VPN router onto the customer's premises and set up NAT's for the devices they want (read "pay us") to monitor, and we're good to go. This is a perfect fit for our customer base, where they don't want to divert resources for the ongoing investment in staff, software, and skills to set up an enterprise-wide monitoring and management solution (not to mention figuring out who's going to handle all those pesky tickets).

So our model - where we have many independent customers with different sets of values, different monitoring requirements and so on has driven us to come up with some customizations that focus on:
  • How to stop alerting on various devices (because of pilot projects, new customer onboarding, or maintenance windows) while continuing to collect statistics
  • How to set thresholds for devices when that could be different on nearly a device-by-device basis
  • How to ignore alerts based on the built-in monitors for CPU/RAM, etc on older or closed-architecture devices where a custom OID gave better data
This post is going to look at our solution for the second bullet - how to set thresholds for devices on a device-by-device basis. You can find the discussion about the first item here.
If you've worked with SolarWinds alerts for more than 15 minutes, you probably already know the slippery slope it presents. You start by setting an alert for CPU with a pretty logical threshold of "> 90% for 10 minutes". Soon after that one of two events happen (or both. It depends on your environment)
  1. Device "owners" complain about all the events you are missing because the threshold is too high
  2. The people receiving alerts complain they are getting too many false alarms because the threshold is too low
About this time you realize that various devices - depending on their machine type, OS, role, or even the specifics of that particular system) require custom thresholds.

So you start copying alerts and modifying them. And when you turn around, you realize you've got 237 different "high CPU" alerts and the logic of each of them ("machine type = "Windows" and IP_Address contains 1.2.3 and (custom field) IS_IMPORTANT = 1 and....") is enough to constipate Einstein.

In a fit of pique during a monitoring review meeting, you throw your hands up in the air and say "why don't I set up a separate threshold for Every. Flipping. Device?!?!?!"

Assuming you retained employment at your company after that outburst, I want to let you in on a secret:

You can.

The key here, much like the one presented earlier for muting, is a couple of custom fields and a little bit of Alert logic.

The Custom Fields
You can call them anything you want, but they should be numeric. Here at Sentinel, we've got ALERT_CPU, ALERT_RAM and ALERT_VOL. The first two go in the nodes table, the last one (logically enough) goes in the volume table.

The Alert Logic
Now the we can alert on individualized thresholds for those elements on a node-by-node basis, leveraging the alert system's "complex conditions" option: "where (field or value) xxx is greater/less/equal to (field or value) yyy".

The alert logic for CPU would look something like this:

Where ANY of the following are true
  Where ALL of the following are true
     ALERT_CPU is empty
     CPU_LOAD is greater than 90
  Where ALL of the following are true
     ALERT_CPU is not empty
     the field CPU_LOAD is greater than the field  ALERT_CPU

This has the effect of setting a default threshold for any device that doesn't have a specific value in the custom alert field (that's the first "Where ALL" section; but if it DOES have a value then compare whatever number is there to the field ALERT_CPU.

For those who are following along from my previous article, here's the logic that includes the "mute" options:

Where ANY of the following are true
  Where ALL of the following are true
     N_MUTE is not equal to YES
     ALERT_CPU is empty
     CPU_LOAD is greater than 90
  Where ALL of the following are true
     N_MUTE is not equal to YES
     ALERT_CPU is not empty
     the field CPU_LOAD is greater than the field  ALERT_CPU

This is also useful if you want to MUTE just one element - say CPU. You have a device that simply "runs hot". You don't want CPU alerts, but you also don't want to mute the whole node, because you still want RAM alerts, interface alerts, etc. Set the ALERT_CPU to 105, and you will continue to collect CPU stats, but (since the CPU can never go above 100), you won't ever get a CPU alert.

IN THE NEXT (AND FINAL) POST: How to ignore built-in alerts for CPU, RAM, etc. in favor of custom OIDS.

2 comments:

mozrat said...

Very nice.

I'd also add that you can send email alerts to a Node Custom Property ${Node.Email_Alert_Group} to further customise the alerts

Leon said...

Mozrat you are absolutely right. That's probably another whole post - places you can use SolarWinds variables that you wouldn't expect.

Thanks!