This is the third in a series of posts where, in the name of giving back to the community, I'm going to share some of the customizations that make SolarWinds a little more robust for us and our customers.
First, a little background about my company and how we use SolarWinds. Sentinel is an IT solutions provider that focuses on communications technologies, Data Center, and Outsourced / Managed Solutions.
One of our key services (and the thing that lets me put food on the table) is a remote monitoring solution (based on SolarWinds, of course). All we have to do is drop a VPN router onto the customer's premises and set up NAT's for the devices they want (read "pay us") to monitor, and we're good to go. This is a perfect fit for our customer base, where they don't want to divert resources for the ongoing investment in staff, software, and skills to set up an enterprise-wide monitoring and management solution (not to mention figuring out who's going to handle all those pesky tickets).
So our model - where we have many independent customers with different sets of values, different monitoring requirements and so on has driven us to come up with some customizations that focus on:
- How to stop alerting on various devices (because of pilot projects, new customer onboarding, or maintenance windows) while continuing to collect statistics
- How to set thresholds for devices when that could be different on nearly a device-by-device basis
- How to ignore alerts based on the built-in monitors for CPU/RAM, etc on older or closed-architecture devices where a custom OID gave better data
If you've been playing along at home, you now have custom fields and alert logic to mute nodes, interfaces volumes and maybe even specialized items like APM; you have fields (and associated alert logic) to allow custom alert thresholds for CPU, RAM, disk space, bandwidth, and whatever else makes your heart beat faster.
But then you run into a situation where the built-in SolarWinds pollers don't work correctly for a particular device. Of course you can set up a custom Universial Device Poller (UnDP), but that doesn't stop the default poller from spewing false alarms.
We have that situation with a series of old Cisco 6500's where the standard SW poller mis-reports CPU; and on some linux-based appliances where the vendor has locked out the standard linux OIDs in favor of their own - but because Orion detects the machine type as "net-snmp" it attempts to pull CPU, RAM, etc using the standards.
The problem (with regard to the ALERT_CPU, ALERT_RAM, etc, custom fields described in part 2 of this series) is that they are all using the standard CPU_LOAD element to compare against.
Of course, you COULD set the ALERT_CPU to some rediculously high number, and then implement a custom alert. We did, but ran into two problems:
- It became difficult to figure out why an alert triggered. We'd see a CPU alert and then notice that the threshold was set to 105%, and things got really confusing until we realized the device in question used a custom CPU OID
- Remember those Linux-based appliances I mentioned earlier? On some of them the standard CPU OID reports 200% or more. Which always makes for jolly good times in the Ops center when they see THAT guage on the screen.
Where ALL of the following are true OVR_STD_CPU is not equal to YES CPU_LOAD is greater than 90The complete alert logic (including muting and standard ALERT_CPU) would now look like this:
Where ANY of the following are true Where ALL of the following are true N_MUTE is not equal to YES OVR_STD_CPU is not equal to YES ALERT_CPU is empty CPU_LOAD is greater than 90 Where ALL of the following are true N_MUTE is not equal to YES ALERT_CPU is not empty OVR_STD_CPU is not equal to YES the field CPU_LOAD is greater than the field ALERT_CPUThis would ensure that the standard CPU alert would NEVER trigger for the node in question. Then we can set up a different alert that uses the custom OID, which uses the existing MUTE and ALERT_xxx logic. Of course it will only trigger when the custom OID was applied to a node.
Where ANY of the following are true Where ALL of the following are true N_MUTE is not equal to YES OVR_STD_CPU is not equal to YES ALERT_CPU is empty is greater than 90 Where ALL of the following are true N_MUTE is not equal to YES ALERT_CPU is not empty OVR_STD_CPU is not equal to YES the field is greater than the field ALERT_CPU