Archive for June, 2004

Quis Custodiet Ipsos Custodes? Monitoring your monitoring system

Monday, June 21st, 2004

The group I’m in here at a major telecommunications provider has a nice little setup of HP’s OpenView and Remedy’s Action Request System.

OpenView listens for and seeks out problems with various internal
and customer-owned systems. When something serious occurs, it uses some
3rd party integration software (RemedySPI) to create a trouble ticket
in Remedy. Rules fire to assign the ticket to an appropriate “triage”
person. Notification is then sent to the assignee by email or text
pager, depending on the severity of the problem. There is a POTS modem
plugged into the Remedy box that is used to send text pages, and the
Remedy box is allowed to make outgoing SMTP connections to the Internet
to deliver e-mail.

Remedy and OpenView are linked so that closing the ticket removes
the corresponding event in OpenView, and the OpenView event is
annotated with the Remedy ticket number.

Both use an Oracle database to store information about tickets and events.

Each system (Remedy, OpenView, Oracle) runs on a separate piece of Sun hardware.

So we have 3 single-points-of-failure that could cause notification
of critical events to stop: Openview (the software, or the hardware
it’s running on), Remedy (likewise) or Oracle (likewise).

Additionally, failure of a single 100baseT switch or switch port would sever connectivity and take out the notification system.

I’d like to set up something that would detect a failure of one or more of the three critical systems, and notify someone.

Obviously this backup notification system has to be independent of
the three in question. So it should run on some 4th piece of hardware
which is plugged into a different switch. It would then poll or look
for heartbeats from the 3 systems, and use it’s own resources to notify
someone of a problem.

As far as I can see it would be OK if this backup monitoring box
used the same Internet connection, since e-mail is only used to deliver
lower-severity notifications, and a loss of Internet connectivity would
be a Critical-level event, which would use the POTS line to deliver the
notification.

The main monitoring system would then be set up to monitor the backup system, to make sure it’s running.

So… finally, to the question:

What software would you use to set up this monitoring/notification
system? Obviously one could install a completely parallel set of
[OpenView, Oracle, Remedy], but that would be overkill, as we only need
to monitor 3 machines and a few daemon processes.

Is there some nice Open Source project out there that would allow me
to quickly solve this problem? Has anyone done something like this? Any
comments on my failure-mode analysis? Am I worrying about the right
things?

Comments

Three comments:

1. I’d make sure that the secondary monitoring system is monitored
by your fancy-dancy system, in case it fails (of course, if both fail
at once, you’re hosed, unless you want to add a third system, and so
on).

2. I don’t know how complex a system you need to do the secondary
monitoring. If all you want to answer is ‘is the software running?’ you
may be able to get by with a simple perl script running from cron (can
I make a connection to Oracle, etc). A quick look at SF and google
didn’t point out anything obviously of relevant to more complex cases.

3. How does Remedy send the text messages over the modem? You need
to consider that in your solution, for sure. Not sure how to do that in
perl.

Posted by: Dan Moore at June 21, 2004 03:33 PM

We use the open source solution: Nagios (http://www.nagios.org/).

It has a simple plugin infrastructure so you can write a bit of Perl code to monitor anything.

It is simple, clean, and very easy to use.

Dion

Posted by: Dion Almaer at June 23, 2004 10:32 AM

Unix command line utility program conventions

Monday, June 7th, 2004

Sometimes a vendor supplies a command-line utility for performing some function that we want to use from within our scripts and programs.

There are some unwritten (at least as far as I can find) rules about how to write one of these utilities so it can be used properly.

Some vendors get this right. Others, not so much…

The Rules

Return an error status indicating success or failure. For bonus points, return multiple different error codes depending on what went wrong. (The Anna Karenina Principle)

And you know what? That’s about it. The rest (arguments, input/output locations, etc., etc.) really depends on the context and function of the item in quesiton. Though the following are useful:

  • Provide a useful usage message if invalid arguments are passed.
  • Provide an explict way (e.g. –help option) to ask for the above usage message.
  • Use GNU-style arguments.

But these are really for human consumption, not for use in a script.

Recent Failures

I won’t name names, but here are some of the failures I’ve seen recently. (And these are from Big Companies that should know better. Including one company that actually has its own version of Unix… they should really know better.)

  • utility always returns status 1. How the HECK am I supposed to know if it worked? Why are you always returning failure? Didn’t you read a single Unix man page? Didn’t you notice that non-zero exit codes mean failure?
  • -q option suppresses error messages to stdout/stderr… and suppresses the error code return as well. Take a look at diff(1) sometime. The -q option just suppresses the listing of the differences, but still returns the error code.

Strangely enough, these rules apply equally well to Windows command-line utilities. Yes, these do exist.

Another suggestion

If it can be done with a command-line utiltiy, then give us an API we can use.

If you even just create a simple C library, we can then wrap it into our favorite language as a Perl Extension Module, or a Java Native Interface (JNI) pacakge.

If you feel like creating a pure Java implementation of the Library, that would be good too.

From Dan Moore:

On a different tack, but still touching some of the same principles, you may want to check out the Command-Line Options section of the Art of Unix Programming

Originally Posted June 7, 2004 11:20 AM