The group I’m in here at a major telecommunications provider has a nice little setup of HP’s OpenView and Remedy’s Action Request System.
OpenView listens for and seeks out problems with various internal
and customer-owned systems. When something serious occurs, it uses some
3rd party integration software (RemedySPI) to create a trouble ticket
in Remedy. Rules fire to assign the ticket to an appropriate “triage”
person. Notification is then sent to the assignee by email or text
pager, depending on the severity of the problem. There is a POTS modem
plugged into the Remedy box that is used to send text pages, and the
Remedy box is allowed to make outgoing SMTP connections to the Internet
to deliver e-mail.
Remedy and OpenView are linked so that closing the ticket removes
the corresponding event in OpenView, and the OpenView event is
annotated with the Remedy ticket number.
Both use an Oracle database to store information about tickets and events.
Each system (Remedy, OpenView, Oracle) runs on a separate piece of Sun hardware.
So we have 3 single-points-of-failure that could cause notification
of critical events to stop: Openview (the software, or the hardware
it’s running on), Remedy (likewise) or Oracle (likewise).
Additionally, failure of a single 100baseT switch or switch port would sever connectivity and take out the notification system.
I’d like to set up something that would detect a failure of one or more of the three critical systems, and notify someone.
Obviously this backup notification system has to be independent of
the three in question. So it should run on some 4th piece of hardware
which is plugged into a different switch. It would then poll or look
for heartbeats from the 3 systems, and use it’s own resources to notify
someone of a problem.
As far as I can see it would be OK if this backup monitoring box
used the same Internet connection, since e-mail is only used to deliver
lower-severity notifications, and a loss of Internet connectivity would
be a Critical-level event, which would use the POTS line to deliver the
The main monitoring system would then be set up to monitor the backup system, to make sure it’s running.
So… finally, to the question:
What software would you use to set up this monitoring/notification
system? Obviously one could install a completely parallel set of
[OpenView, Oracle, Remedy], but that would be overkill, as we only need
to monitor 3 machines and a few daemon processes.
Is there some nice Open Source project out there that would allow me
to quickly solve this problem? Has anyone done something like this? Any
comments on my failure-mode analysis? Am I worrying about the right