You are currently browsing the Tom Malaher’s BrainScan weblog archives for the day June 21, 2004.
- Bookmarks (147)
- Books (1)
- Fr0n (1)
- LifeInGeneral (2)
- On the trail (4)
- Photos (4)
- Practice (7)
- Software (11)
- Tools (9)
- April 2, 2012: Some days, I love the Internet!
- July 10, 2011: links for 2011-07-10
- July 4, 2011: links for 2011-07-04
- June 22, 2011: links for 2011-06-22
- May 17, 2011: links for 2011-05-17
- May 16, 2011: links for 2011-05-16
- May 12, 2011: links for 2011-05-12
- May 10, 2011: links for 2011-05-10
- April 28, 2011: links for 2011-04-28
- April 23, 2011: links for 2011-04-23
- April 2012
- July 2011
- June 2011
- May 2011
- April 2011
- March 2011
- February 2011
- January 2011
- December 2010
- November 2010
- October 2010
- September 2010
- August 2010
- July 2010
- June 2010
- May 2010
- March 2010
- February 2010
- January 2010
- December 2009
- November 2009
- October 2009
- September 2009
- August 2009
- July 2009
- April 2009
- March 2009
- January 2009
- December 2008
- November 2008
- October 2008
- September 2008
- August 2008
- May 2008
- April 2008
- March 2008
- February 2008
- January 2008
- December 2007
- November 2007
- October 2007
- September 2007
- August 2007
- July 2007
- May 2007
- April 2007
- March 2007
- December 2006
- May 2006
- June 2004
- May 2004
- February 2004
Archive for June 21, 2004
Quis Custodiet Ipsos Custodes? Monitoring your monitoring system
June 21, 2004 by Tom Malaher.
The group I’m in here at a major telecommunications provider has a nice little setup of HP’s OpenView and Remedy’s Action Request System.
OpenView listens for and seeks out problems with various internal
and customer-owned systems. When something serious occurs, it uses some
3rd party integration software (RemedySPI) to create a trouble ticket
in Remedy. Rules fire to assign the ticket to an appropriate “triage”
person. Notification is then sent to the assignee by email or text
pager, depending on the severity of the problem. There is a POTS modem
plugged into the Remedy box that is used to send text pages, and the
Remedy box is allowed to make outgoing SMTP connections to the Internet
to deliver e-mail.
Remedy and OpenView are linked so that closing the ticket removes
the corresponding event in OpenView, and the OpenView event is
annotated with the Remedy ticket number.
Both use an Oracle database to store information about tickets and events.
Each system (Remedy, OpenView, Oracle) runs on a separate piece of Sun hardware.
So we have 3 single-points-of-failure that could cause notification
of critical events to stop: Openview (the software, or the hardware
it’s running on), Remedy (likewise) or Oracle (likewise).
Additionally, failure of a single 100baseT switch or switch port would sever connectivity and take out the notification system.
I’d like to set up something that would detect a failure of one or more of the three critical systems, and notify someone.
Obviously this backup notification system has to be independent of
the three in question. So it should run on some 4th piece of hardware
which is plugged into a different switch. It would then poll or look
for heartbeats from the 3 systems, and use it’s own resources to notify
someone of a problem.
As far as I can see it would be OK if this backup monitoring box
used the same Internet connection, since e-mail is only used to deliver
lower-severity notifications, and a loss of Internet connectivity would
be a Critical-level event, which would use the POTS line to deliver the
notification.
The main monitoring system would then be set up to monitor the backup system, to make sure it’s running.
So… finally, to the question:
What software would you use to set up this monitoring/notification
system? Obviously one could install a completely parallel set of
[OpenView, Oracle, Remedy], but that would be overkill, as we only need
to monitor 3 machines and a few daemon processes.
Is there some nice Open Source project out there that would allow me
to quickly solve this problem? Has anyone done something like this? Any
comments on my failure-mode analysis? Am I worrying about the right
things?
Comments
We use the open source solution: Nagios (http://www.nagios.org/).
It has a simple plugin infrastructure so you can write a bit of Perl code to monitor anything.
It is simple, clean, and very easy to use.
Dion
Posted by: Dion Almaer at June 23, 2004 10:32 AM
Posted in Practice, Software | Print | No Comments »
Three comments:
1. I’d make sure that the secondary monitoring system is monitored
by your fancy-dancy system, in case it fails (of course, if both fail
at once, you’re hosed, unless you want to add a third system, and so
on).
2. I don’t know how complex a system you need to do the secondary
monitoring. If all you want to answer is ‘is the software running?’ you
may be able to get by with a simple perl script running from cron (can
I make a connection to Oracle, etc). A quick look at SF and google
didn’t point out anything obviously of relevant to more complex cases.
3. How does Remedy send the text messages over the modem? You need
to consider that in your solution, for sure. Not sure how to do that in
perl.
Posted by: Dan Moore at June 21, 2004 03:33 PM