Quis Custodiet Ipsos Custodes? Monitoring your monitoring system

The group I’m in here at a major telecommunications provider has a nice little setup of HP’s OpenView and Remedy’s Action Request System.

OpenView listens for and seeks out problems with various internal
and customer-owned systems. When something serious occurs, it uses some
3rd party integration software (RemedySPI) to create a trouble ticket
in Remedy. Rules fire to assign the ticket to an appropriate “triage”
person. Notification is then sent to the assignee by email or text
pager, depending on the severity of the problem. There is a POTS modem
plugged into the Remedy box that is used to send text pages, and the
Remedy box is allowed to make outgoing SMTP connections to the Internet
to deliver e-mail.

Remedy and OpenView are linked so that closing the ticket removes
the corresponding event in OpenView, and the OpenView event is
annotated with the Remedy ticket number.

Both use an Oracle database to store information about tickets and events.

Each system (Remedy, OpenView, Oracle) runs on a separate piece of Sun hardware.

So we have 3 single-points-of-failure that could cause notification
of critical events to stop: Openview (the software, or the hardware
it’s running on), Remedy (likewise) or Oracle (likewise).

Additionally, failure of a single 100baseT switch or switch port would sever connectivity and take out the notification system.

I’d like to set up something that would detect a failure of one or more of the three critical systems, and notify someone.

Obviously this backup notification system has to be independent of
the three in question. So it should run on some 4th piece of hardware
which is plugged into a different switch. It would then poll or look
for heartbeats from the 3 systems, and use it’s own resources to notify
someone of a problem.

As far as I can see it would be OK if this backup monitoring box
used the same Internet connection, since e-mail is only used to deliver
lower-severity notifications, and a loss of Internet connectivity would
be a Critical-level event, which would use the POTS line to deliver the
notification.

The main monitoring system would then be set up to monitor the backup system, to make sure it’s running.

So… finally, to the question:

What software would you use to set up this monitoring/notification
system? Obviously one could install a completely parallel set of
[OpenView, Oracle, Remedy], but that would be overkill, as we only need
to monitor 3 machines and a few daemon processes.

Is there some nice Open Source project out there that would allow me
to quickly solve this problem? Has anyone done something like this? Any
comments on my failure-mode analysis? Am I worrying about the right
things?

Comments

Three comments:

1. I’d make sure that the secondary monitoring system is monitored
by your fancy-dancy system, in case it fails (of course, if both fail
at once, you’re hosed, unless you want to add a third system, and so
on).

2. I don’t know how complex a system you need to do the secondary
monitoring. If all you want to answer is ‘is the software running?’ you
may be able to get by with a simple perl script running from cron (can
I make a connection to Oracle, etc). A quick look at SF and google
didn’t point out anything obviously of relevant to more complex cases.

3. How does Remedy send the text messages over the modem? You need
to consider that in your solution, for sure. Not sure how to do that in
perl.

Posted by: Dan Moore at June 21, 2004 03:33 PM

We use the open source solution: Nagios (http://www.nagios.org/).

It has a simple plugin infrastructure so you can write a bit of Perl code to monitor anything.

It is simple, clean, and very easy to use.

Dion

Posted by: Dion Almaer at June 23, 2004 10:32 AM

Leave a Reply