Service monitoring for OSDT

One of the questions we're working on is how to monitor a network infrastructure that includes DataTurbine. A typical instance might include the following:

  1. Network time server
  2. Data source machine, typically a data acquisition system
  3. Video or image source
  4. DataTurbine
  5. Client machines running RDV
  6. Network infrastructure: Routers, DNS server, switches, etc

Failures of different components have different symptoms, and it can be extremely frustrating to diagnose. Our systems at UCSD are quite complex, and we've had some head-scratchers as we advance the leading edge.

Here are some solutions we've tried so far, with notes as to results.

  1. Smokeping. This is quite good for monitoring connectivity and latency, with excellent graphs that help with intermittent or systematic problems. It sends (by default) 5 ICMP packets every 300 seconds, very low load.
  2. ntop This is running on niagara, our central server, to measure and report on network loads, connectivity stats and much much more. Really slick, amazingly detailed information but it seems to crash every few hours, at least on our configuration.
  3. Monit and m/monit. I initially found Monit via freshmeat.net while looking for a fast way to monitor standard TCP/UDP services. Monit fits that bill extremely well, its a sub-megabyte executable that can check all sorts of things: files, permissions, system params such as CPU, port connectivity, ICMP and (most interestingly) it can do real checks of protocols like NTP, SMTP, HTTP and more. M/Monit sits on top of monit and aggregates servers via a Flash interface and also keeps event logs. Both are excellent and seem to work well. Unfortunately, they are aimed at the sysadmin and lack a read-only display interface. monit allows you to stop/start/restart services from the web, with potentially awful results. M/monit does have a read-only user, but you still have to login and are blocked from the detailed and useful information. So close... The other problem with monit is that its deucedly difficult to extend, so testing DataTurbines has to be done via send/expect functionality, which is lame.
  4. Nagios Plugin-based, knows several network protocols, scriptable, ugly web interface IMHO, very mature but not as slick as monit.
  5. Inca This is a local (SDSC) product, and Shava Smallen has written an RBNB query driver that queries RBNB version and channel list. Currently monitoring GLEON, hope to deploy it locally ASAP.

Local instances:

  1. ntop on niagara
  2. monit on iguassu (login required as noted above)
  3. m/monit on iguassu, use monit/monit to login please.
  4. Inca deployment monitorning GLEON

Dynasoar is a colocation

Dynasoar is a colocation hosting service-oriented framework for the dynamic deployment of services on the Grid or the Internet. This paper proposes a framework for monitoring the usage and messaging activity of Web Services deployed through windows hosting and managed by Dynasoar. Message routing or service deployment-related decisions are made based on the analysis of deployment and run-time information collected during interactions with services. The framework can be extended by adding new algorithms to process and manage such coldfusion hosting information. This paper describes the architecture, design, implementation, and evaluation of the proposed monitoring framework for asp hosting Dynasoar.