Tuesday, August 30, 2011

Third (and final) Post on SolarWinds tricks

(Originally posted on www.thwack.com here)


This is the third in a series of posts where, in the name of giving back to the community, I'm going to share some of the customizations that make SolarWinds a little more robust for us and our customers.

First, a little background about my company and how we use SolarWinds. Sentinel is an IT solutions provider that focuses on communications technologies, Data Center, and Outsourced / Managed Solutions.

One of our key services (and the thing that lets me put food on the table) is a remote monitoring solution (based on SolarWinds, of course). All we have to do is drop a VPN router onto the customer's premises and set up NAT's for the devices they want (read "pay us") to monitor, and we're good to go. This is a perfect fit for our customer base, where they don't want to divert resources for the ongoing investment in staff, software, and skills to set up an enterprise-wide monitoring and management solution (not to mention figuring out who's going to handle all those pesky tickets).

So our model - where we have many independent customers with different sets of values, different monitoring requirements and so on has driven us to come up with some customizations that focus on:
  • How to stop alerting on various devices (because of pilot projects, new customer onboarding, or maintenance windows) while continuing to collect statistics
  • How to set thresholds for devices when that could be different on nearly a device-by-device basis
  • How to ignore alerts based on the built-in monitors for CPU/RAM, etc on older or closed-architecture devices where a custom OID gave better data
This post is going to look at our solution for the third bullet - how to ignore built-in SolarWinds values in favor of custom OIDs. You can find the discussion about the first item here and the second item's information here.

If you've been playing along at home, you now have custom fields and alert logic to mute nodes, interfaces volumes and maybe even specialized items like APM; you have fields (and associated alert logic) to allow custom alert thresholds for CPU, RAM, disk space, bandwidth, and whatever else makes your heart beat faster.

But then you run into a situation where the built-in SolarWinds pollers don't work correctly for a particular device. Of course you can set up a custom Universial Device Poller (UnDP), but that doesn't stop the default poller from spewing false alarms.

We have that situation with a series of old Cisco 6500's where the standard SW poller mis-reports CPU; and on some linux-based appliances where the vendor has locked out the standard linux OIDs in favor of their own - but because Orion detects the machine type as "net-snmp" it attempts to pull CPU, RAM, etc using the standards.

The problem (with regard to the ALERT_CPU, ALERT_RAM, etc, custom fields described in part 2 of this series) is that they are all using the standard CPU_LOAD element to compare against.

Of course, you COULD set the ALERT_CPU to some rediculously high number, and then implement a custom alert. We did, but ran into two problems:
  1. It became difficult to figure out why an alert triggered. We'd see a CPU alert and then notice that the threshold was set to 105%, and things got really confusing until we realized the device in question used a custom CPU OID
  2. Remember those Linux-based appliances I mentioned earlier? On some of them the standard CPU OID reports 200% or more. Which always makes for jolly good times in the Ops center when they see THAT guage on the screen.
So we've implemented OVR_STD_CPU and OVR_STD_RAM fields (both simple Yes/No custom properties) to get around this. Effectively, this tells SolarWinds that a non-standard OID is being used as the key element, and the standard OID should be skipped.

Where ALL of the following are true
  OVR_STD_CPU is not equal to YES
  CPU_LOAD is greater than 90
The complete alert logic (including muting and standard ALERT_CPU) would now look like this:

Where ANY of the following are true
  Where ALL of the following are true
     N_MUTE is not equal to YES
     OVR_STD_CPU is not equal to YES
     ALERT_CPU is empty
     CPU_LOAD is greater than 90
  Where ALL of the following are true
     N_MUTE is not equal to YES
     ALERT_CPU is not empty
     OVR_STD_CPU is not equal to YES
     the field CPU_LOAD is greater than the field  ALERT_CPU
This would ensure that the standard CPU alert would NEVER trigger for  the node in question. Then we can set up a different alert that uses  the custom OID, which uses the existing MUTE and ALERT_xxx logic. Of  course it will only trigger when the custom OID was applied to a node.

Where ANY of the following are true
  Where ALL of the following are true
     N_MUTE is not equal to YES
     OVR_STD_CPU is not equal to YES
     ALERT_CPU is empty
      is greater than 90
  Where ALL of the following are true
     N_MUTE is not equal to YES
     ALERT_CPU is not empty
     OVR_STD_CPU is not equal to YES
     the field  is greater than the field  ALERT_CPU

No comments: