Softpanorama
(slightly skeptical) Open Source Software Educational Society

May the source be with you, but remember the KISS principle ;-)

Softpanorama Search

Nagios in Large Enterprise Environment

News See also Books Recommended Links NSCA Daemon Perl
External command pipe Adaptive monitoring Using SSH with Nagios Using M4 for configuration NRPE Plugin check_by_ssh

Simulating Telnet from a Program

SSH for System Administrators Rsync   Humor Etc

Nagios was formerly known as Netsaint. It was written by Ethan Galstad approximately ten years ago.  It looks like the author has no previous experience with commercial monitoring systems.

Functionally Nagios in not much more them a simple daemon  implementing probes scheduling. It is written in C .  Generally it is agentless solution (like SiteScope) but functionality for using SSH and telnet is very basic. Initially it was designed to monitor the host it was running in plus network services. In a way Nagios is still the most suitable for network monitoring or monitoring a limited number of servers via SSH (or rsh or telnet).

By default the Nagios plugins that perform various checks and measurements run on the Nagios server, not on monitored resources (scheduled to run on localhost). Later Nagios added so called the Nagios Remote Plug-in Executor (NRPE)  by in terms of architecture and functionality, it is not competitive with well engineered agents.

They can generated messages which are forward to Web interface, SMTP mail and similar services.  Attempts to extend Nagios to monitoring multiple servers are very clumsy. There is no nothing like concept of access path to host or host access method.  Facilities provided (check_by_ssh and NPRE) both smell like cheap hacks. For example, here is no way to associate host of host groups with the particular access method for probes in Nagios configuration.  For problems in using Nagios in large enterprise environment see Deploying Nagios in a Large Enterprise Environment, at USENIX LISA '07

For more application oriented monitoring system see OpenSMART by Ulrich Herbst

Nagios impose very few limitation on the structure and communication of probes (plug-ins in Nagios terminology) . They can be any legitimate Unix executable written in any language (shell, Perl, C, etc). After execution Nagios grabs the first line of text from STDOUT. It also capture plug-in return code:

Numeric Value Service Status Status Description
0 OK The plugin was able to check the service and it appeared to be functioning properly
1 Warning The plugin was able to check the service, but it appeared to be above some "warning" threshold or did not appear to be working properly
2 Critical The plugin detected that either the service was not running or it was above some "critical" threshold
3 Unknown Invalid command line arguments were supplied to the plugin or low-level failures internal to the plugin (such as unable to fork, or open a tcp socket) that prevent it from performing the specified operation. Higher-level errors (such as name resolution errors, socket timeouts, etc) are outside of the control of plugins and should generally NOT be reported as UNKNOWN states.

Result are communicated via Nagios environment which consists of multiple macros. Each macro is essentially an environment variable that is populated by Nagios and its value is inserted into the probe invocation string, for example

define command{
command_name    check_ping
command_line    /usr/local/nagios/libexec/check_ping -H $HOSTADDRESS$ -w 100.0,90% -c 200.0,60%
}

Here $HOSTADDRESS$ is a macro populated by Nagios. Actually, starting with Nagios 2.0, most macros have been made available as environment variables.

The ability to use services like SSH and telnet for communication with remote host are not built-in into the system and you need to specify each such probe in configuration file.  Another possibility to  write custom probes envelope in some scripting language, for example Perl that will detect which host should be communicated with which method and them run probe accordingly.  In general for non-trivial enterprise network you probably need to program your own probes envelope or hire somebody to do this.

Configuration of Nagios is rather verbose and should generally be generated with macro generator like M4 or you will hate Nagios from day one :-).

The structure of the events is very primitive. Structured fields are essentially limited to name of the host and severity. everything else need to interpreted from the test of the message.  

The only interesting architectural feature of  Nagios that I have found  is the concept of adaptive monitoring. Adaptive monitoring allowed a Nagios configuration to be changed during runtime. Currently it is limited to the ability to change the interval between checks and times during which checks are scheduled to occur. This allows you to turn on/off checks at specific times according to conditions in your environment. Not very impressive capability but a step in right direction. 

Nagios has the ability to notify contacts (via email, pager or other methods) when problems arise and are resolved.  This is handles by communication modules. some of them (for SMTP) are provided in the distribution.

Web interface is provided. The information that Nagios collects is displayed in a set  of automatically updated Web pages. Several CGIs are included in order to allow you to view the current and historical status via a Web browser. WAP interface is also provided to allow you to acknowledge problems and disable notifications from an internet-ready cellphone. The narrow column on the left of the display lists links to all of the possible Nagios web pages (the one for the current page has been highlighted in the illustration).

The Tactical Overview page shows general statistics about the overall monitored infrastructure status like the number of hosts which are down, unreachable, etc.  The display also indicates that number of services in "critical" status (probably indicating a failure), as well as other states. Each of the problem indicator displays also functions as a link to another Web page giving details about that particular item.

Nagios prepackaged modules (see Configuring Nagios Commands) can monitors a pretty wide variety of system properties, including standard system performance metrics such as load average and free disk space; the presence of important services like HTTP and SMTP as well as host network availability and reachability. It also allows the system administrator to define what constitutes a significant event on each host--for example, how high a load average is "too high"--and what to do when such conditions are detected.

In addition to detecting problems with hosts and their important services, Nagios also allows the system administrator to specify what should be done as a result. A problem can trigger an alert to be sent to a designated recipient via various communication mechanisms (such as email, Unix message, pager). It is also possible to define an event handler: a program that is run when a problem is detected. Such programs can attempt to solve the problem encountered, and they can also proactively prevent some serious problems when they get triggered by warning conditions.

Available actions in the Nagios Host Information display

Item Meaning
Disable checks of this host Stop monitoring this host for availability.
Acknowledge this host problem Respond to a current problem (discussed below).
Disable notifications for this host Don't send alerts if this host is unavailable.
Delay next host notification Delay the next alert for host unavailability.
Schedule downtime for this host. Cancel scheduled downtime for this host Define or cancel schedule downtime. During downtime, host unavailability is not considered a problem
Disable notifications for all services on this host. Enable notifications for all services on this host. Don't/do send alerts if a service on this host fails.
Schedule an immediate check of all services on this host Check all services as soon as possible (rather than waiting for their next scheduled time).
Disable checks of all services on this host
Enable checks of all services on this host
Disable or enable checking service health on this host.
Disable event handler for this host Prevent the event handler from running when a problem is detected on this host.
Disable flap detection for this host Don't try to detect flaps (rapid up-down or on-off oscillations) on this host or its services.

The second menu item allows you to acknowledge any current problem. Acknowledging simply means "I know about the problem, and it is being handled." Nagios marks the corresponding event as such, and future alerts are suppressed until the item returns to its normal state. This process also allows you to enter a comment explaining the situation, an action that is helpful when more than one administrator regularly examines the monitoring data.

If you don't like all of these table-oriented status displays, Nagios also has the capability to use graphical ones. See the Nagios Web site for example screen shots.

Configuring Nagios

Configuring Nagios is a punishment and is time consuming abd boring. Sometimes question arise about inclinations of the designers toward red tape. Some of those deficiencies are connected with the fact that the system was programmed in C.  Nagios uses the half-dozen of configuration files. Among them

The package provides sample starter versions of all of these file. We will consider some aspects of these file types in the remainder of this article.

Nagios configuration files are generally stored in /usr/local/nagios/etc

The nagios.cfg File

This configuration file contains directives that apply to the entire Nagios monitoring system. Here is an annotated sample version illustrating some of its most important features:

# File locations
log_file=/var/log/nagios.log
cfg_file=/etc/opt/nagios/checkcommands.cfg
cfg_file=/etc/opt/nagios/misccommands.cfg
cfg_file=/etc/opt/nagios/hosts.cfg
resource_/etc/opt/nagios/resource.cfg
lock_file=/var/run/nagios.lock
... 

The first part of the configuration file specifies various file locations, including the general log file, files holding service check command and notification and event handler command definitions (checkcommands and misccommands). Other cfg_file directives are used by the administrator to specify the object definition files in use at that site (indicated by the one in red). Locations for other types of files follow. The lock file holds the PID of the current nagios process.

# Logging settings
log_rotation_method=d 
log_archive_path=/var/log/nagios
use_syslog=1
log_host_retries=1
log_event_handlers=1
...

These directives specify logging settings, including how often logs are rotated (here, daily), the archive directory for old files, whether to log significant problems to syslog as well, and whether to log individual event types.

# Global settings
nagios_user=nagios
nagios_group=nagios
date_format=us
admin_email=nagadmin
admin_pager=19995551212

These lines specify various global settings, including the user/group as which the nagios daemon runs, the output format for dates (here, US style), and the administrator's email address. The final item sets the value of the $ADMINPAGER$ macro, which can be used in command definitions.

# Package-wide event handlers
enable_event_handlers=1
global_host_event_handler=global-event-command
global_service_event_handler=global-svc-command

Settings related to event handlers. You can optionally define a single event handler for all host failures and service failures in this file if appropriate. Commands are defined in an object configuration file.

# Concurrent checks and time-outs
max_concurrent_checks=0
service_check_timeout=60
host_check_timeout=30
event_handler_timeout=30
notification_timeout=30
...

These directives control the number of maximum checks that can be made at the same time (0 means an unlimited number), as well as time-outs for various types of commands (values in seconds).

# Retained status information
retain_state_information=1
retention_update_interval=60
use_retained_program_state=1

These lines tell Nagios to retain information about host and service status between sessions, saving the values every 60 seconds, and reloading them when the facility starts up.

# Passive service checks
accept_passive_service_checks=1
check_service_freshness=1

These directives enable "passive checks": status data produced by external commands which Nagios imports periodically.

# Save Nagios data for later use
process_performance_data=1
host_perfdata_command=process-host-perfdata
service_perfdata_command=process-service-perfdata

These directives allow you to save Nagios data externally for long term analysis or other purposes. The commands specified here must be defined in some object configuration file. The simplest such command simply writes the command's output to an external file: e.g., echo $OUTPUT$ >> file, but you can perform whatever action is appropriate (e.g., send the data to an RRDTool or other database).

Note that the directives appear in a slightly different order in the sample nagios.cfg file provided with the package.

Object Configuration Files

The bulk of Nagios configuration occurs in the object configuration files. These files define hosts and services to be monitored, how various status conditions should be interpreted, and what actions should be taken when they occur. These files are used to define the following items:

The items in red will need to be defined for virtually every Nagios installation; the ones in black are optional. In the sample Nagios configuration provided with the package, each type of object is defined in a separate configuration file (named after the object type, excluding any spaces). However, you can arrange your definitions in any form that makes sense to you.

Hosts and Host Groups

All of these items are defined via templates: named sets of attributes and settings that can be easily applied to any number of actual objects. For example, here is a template definition for hosts:

define host{
; Template name
   name                      normal  
; This is only a template (not a real host)
   register                       0 

; Host notifications are enabled
   notifications_enabled          1  
; Command to check if host is available
   check_command   check-host-alive  
; Recheck failures this many times
   max_check_attempts             1
; Repeat failure notifications every 2 hours
   notification_interval        120  
; When to check (time period name)
   notification_period         24x7  
; Notify when down, unreachable and on recovery
   notification_options       d,u,r  
; Host event handler is enabled
   event_handler_enabled          1  
; Event handler command (defined elsewhere)
   event_handler            host-eh  
; Flap detection is disabled
   flap_detection_enabled         0  
; Save performance data
   process_perf_data              1 
; Save status information across restarts 
   retain_status_information      1  
}

This template defines a variety of host-monitoring settings (which are explained in the comments following the semicolons). Here is a host definition that uses this template:

define host{
; Template on which to base host
   use                        normal  
; Note the attribute is not "name" as above
   host_name                  beulah  
; Longer description
   alias            beulah: SuSE 8.1  
; IP address
   address              192.168.1.44  
; Overrides template value
   max_check_attempts              8  
}

Other hosts may be defined in a similar way. Host definitions themselves can also be used as templates, provided that a name attribute is included.

Once hosts have been defined, they may be placed into host groups via directives like this one:

define hostgroup{
   hostgroup_name      bldg2
   alias               Building 2
   contact_groups      admins1
   members             beulah,callisto,ariadne,leah,lovelace,valley
}

This definition creates the host group named bldg2, consisting of six hosts (all previously defined via define host directives). The contact_groups attribute specifies who to send notifications to, and it is defined elsewhere (as we'll see).

You can use as many host groups as you want to. Hosts can be part of multiple host groups, and host groups themselves may be nested.

Services

Here are two service templates and a service definition:

define service{  ; Define defaults for all services
   name                  generic
   register                    0
; Check service every 30 minutes
   normal_check_interval      30  
; Retry failing checks every 3 minutes, up to 5 times
   retry_check_interval        3  
   max_check_attempts          5
   event_handler_enabled       1
   check_period             24x7
; Repeat notifications for failures every 2 hours
   notification_interval     120  
   notification_period     6to22
; Notify contacts about critical failures/recoveries
   notification_options      c,r  
   notifications_enabled       1
   contact_groups         admins
} 

define service{  ; Define the SMTP service
   use                    generic
   name              generic-smtp
   register                     0

   service_description Check SMTP
   check_command       check_smtp
   event_handler          eh_smtp
   contact_groups      mailadmins
}

define service{  ; Define services to be monitored
   use               generic-SMTP
; Monitor SMTP for all hosts in this host group
   host_groups          mailhosts  
}

The first template (generic) defines some settings, which can be applied to a variety of service types. The second template, generic-SMTP, uses the first template as a starting point and adds to them in order to create a generic SMTP monitoring service. Specifically, it defines a check command, an event handler, and a contact group that are appropriate for the SMTP service. The final define service stanza sets up SMTP monitoring for all of the hosts in the mailhosts host group.

Contacts and Contact Groups

Here are two stanzas defining a contact and a contact group:

define contact{
   contact_name                    nagadmin
   alias                           Nagios Admin
; When to notify about service problems
   service_notification_period     6to22  
; When to notify about host problems
   host_notification_period        24x7   
; Notify on critical problems and recoveries
   service_notification_options    c,r    
; Notify on host down and recoveries
   host_notification_options       d,r    
   service_notification_commands   notify-by-email
   host_notification_commands      host-notify-by-epager
   email                           nagios-admins@ahania.com
   pager                           $ADMINPAGER$
}

define contactgroup{
   contactgroup_name               mailadmins
   alias                           Mail Admins
   members                         mailadm,chavez,catfemme
}

The first stanza defines a contact named nagadmin. It also defines what events to notify this contact about and the time periods during which notifications should be sent. The commands to use to generate the alerts are also specified, along with arguments to them (see below).

Time Periods

Time period definitions are quite simple. Here are the definitions of the two time periods we have used so far:

define timeperiod{
   timeperiod_name 24x7
   alias           24 Hours A Day, 7 Days A Week
   sunday          00:00-24:00
   monday          00:00-24:00
   tuesday         00:00-24:00
   wednesday       00:00-24:00
   thursday        00:00-24:00
   friday          00:00-24:00
   saturday        00:00-24:00
}

define timeperiod{
   timeperiod_name 6to22
   alias           Weekdays, 6 AM to 10 PM
   Monday          06:00-22:00
   Tuesday         06:00-22:00
   Wednesday       06:00-22:00
   Thursday        06:00-22:00
   Friday          06:00-22:00
}

Note that only the applicable days need be included in the definition.

Commands

The commands referred to in many of the preceding object definitions also must be defined. For example, here is the SMTP service check command definition:

define command{
   command_name  check_smtp
   command_line  $USER1$/check_smtp -H $HOSTADDRESS$
}

This command runs the check_smtp script stored in the directory defined in the macro $USER1$ (defined in the resource.cfg file--see below); this macro conventionally holds the path to the Nagios plug-ins directory. The command is passed the option -H, followed by the IP address of the host to be checked (the latter is expanded from the built-in $HOSTADDRESS$ macro).

You can determine the syntax for any plug-in by running it with the --help option. You can also extend Nagios by adding custom plug-ins of your own. See the documentation for details on how to accomplish this.

Event handlers are defined in the same way, as in this example:

define command{
   command_name  eh_smtp
   command_line  /usr/local/nagios/eh/fix_mail $HOSTADDRESS$ $STATETYPE$
}

Here, we define the command named eh_smtp. It specifies the full path to a program to run, passing two arguments: the host's IP address and the value of the $STATETYPE$ macro. This item is set to HARD for critical failures and SOFT for warnings.

Here are the definitions of commands used for notifications (we've wrapped the command_line setting for clarity):

define command{
   command_name  notify-by-email
   command_line  /usr/bin/printf "%b" "***** Nagios 1.0 *****\n\n
                 Notification Type: $NOTIFICATIONTYPE$\n\n
                 Service: $SERVICEDESC$\n
                 Host: $HOSTALIAS$\n
                 Address: $HOSTADDRESS$\n
                 State: $SERVICESTATE$\n\n
                 Date/Time: $DATETIME$\n\n
                 Additional Info:\n\n$OUTPUT$" | 
     /usr/bin/mail -s "** $NOTIFICATIONTYPE$
     alert - $HOSTALIAS$/$SERVICEDESC$ 
                 is $SERVICESTATE$ **" $CONTACTEMAIL$
}

This command constructs a simple email message using the printf command and many built-in Nagios macros. It then sends the message using the mail command, specifying the recipient as the $CONTACTEMAIL$ macro. The latter contains the value of the corresponding email attribute for the host or service that is generating the alert.

The cgi.cfg File

The cgi.cfg configuration file has several different functions with the Nagios system. Among the most important is authentication, allowing Nagios and its data to be restricted to appropriate people. Here are some sample directives related to authorization:

use_authentication=1
authorized_for_configuration_information=netsaintadmin,root,chavez
authorized_for_all_services=netsaintadmin,root,chavez,maresca

The first entry enables the access control mechanism. The next two entries specify users who are allowed to view Nagios configuration information and services status information (respectively). Note that all users also must be authenticated to the Web server using the usual Apache htpasswd mechanism.

This same configuration file is also used to store settings for icon-based status displays, as in these examples:

hostextinfo[janine]=;redhat.gif;;redhat.gd2;;168,36;,,;
hostextinfo[ishtar]=;apple.gif;;apple.gd2;;125,36;,,;

These entries specify extended attributes for the hosts defined in the entries labeled janine and ishtar. The filenames in this example specify images files for the host in status tables (GIF format--see Figure 3) and in the status map (GD2 format), and the two numeric values specify the device's location--for example, x and y coordinates--within the 2D status map. (Figure 4 provides an example status map display).

The resource.cfg File

The final configuration file we will consider is the resource.cfg file. It is used to define site-specific macros, strangely named $USER1$ through $USER32$:

# $USER1$ = path to plugins directory
$USER1$=/usr/lib/nagiosplugins    
...

# Store a username and password (hidden)
$USER3$=admin
$USER4$=somepassword

The first macro defines the path to the Nagios plug-ins directory; this usage is assumed by the supplied sample configuration files.

The other two macros are used in this case to store a username and password. These items can be used in command definitions for added security. The resource.cfg file itself can be protected against all non-root access without compromising the ability of CGI programs to run successfully.

Checking a Nagios Configuration

Since Nagios configuration is somewhat involved, the package provides a command that can be used to verify it prior to running the program. Here is an example of its use:

# cd /etc/opt/nagios/nagios/etc
# /usr/local/nagios/bin/nagios -v nagios.cfg

This will check the Nagios configuration, which uses nagios.cfg as its main configuration file.

External Command Pipe

Linux.com Using external commands in Nagios By Wojciech Kocjan

November 20, 2008 | Linux.com

System monitoring tool Nagios offers a powerful mechanism for receiving events and commands from external applications. External commands are usually sent from event handlers or from the Nagios Web interface. You will find external commands most useful when writing event handlers for your system, or when writing an external application that interacts with Nagios.

This article is excerpted from the newly published book Learning Nagios 3.0 from Packt Publishing.

The external commands pipe is a pipe file created on a filesystem that Nagios uses to receive incoming messages. The communication does not use any authentication or authorization -- the only requirement is to have write access to the pipe file, rw/nagios.cmd, which is located in the directory passed as the localstatedir option during compilation.

An external command file is usually writable by the owner and the group; the usual group used is nagioscmd. If you want a user to be able to send commands to the Nagios daemon, simply add that user to this group.

A small limitation of the command pipe is that there is no way to get any results back, so it is not possible to send any query commands to Nagios. Therefore, by just using the command pipe, you have no verification that the command you have passed to Nagios has been processed, or will be processed soon. It is, however, possible to read the Nagios log file and check whether it indicates that the command has been parsed correctly.

The Nagios Web interface uses an external command pipe to control how Nagios works. The Web interface does not use any other means to send commands or apply changes to Nagios.

From the Nagios daemon perspective, there is no clear distinction as to who can perform what operations. Therefore, if you plan to use the external command pipe to allow users to submit commands remotely, you need to make sure that authorization is in place so that unauthorized users cannot send potentially dangerous commands to Nagios.

The syntax for formatting commands is easy. Each command must be placed on a single line and end with a newline character. The syntax is as follows:

[TIMESTAMP] COMMAND_NAME;argument1;argument2;...;argumentN
 

TIMESTAMP is written as Unix time -- that is, the number of seconds since 1970-01-01 00:00:00. You can create this by using the date command. Most programming languages also offer the means to get the current Unix time.

Commands are written in upper case. The arguments depend on the actual command. For example, to add a comment to a host stating that it has passed a security audit, you can use the following shell command:

 
echo "['date +%s'] ADD_HOST_COMMENT;somehost;1;Security Audit; This host has passed security audit on 'date +%Y-%m-%d'" >/var/nagios/rw/nagios.cmd
 

This will send an ADD_HOST_COMMENT command to Nagios over the external command pipe. Nagios will then add a comment to the host, somehost, stating that the comment originated from Security Audit. The first argument specifies the host name to add the comment to; the second tells Nagios if this comment should be persistent. The next argument describes the author of the comment, and the last argument specifies the actual comment text.

Similarly, adding a comment to a service requires the use of the ADD_SVC_COMMENT command. The command's syntax is similar to that of the ADD_HOST_COMMENT command except that the command requires the specification of the host name and service name.

You can also delete a single comment or all comments using the DEL_HOST_ COMMENT, DEL_ALL_HOST_COMMENTS, and DEL_SVC_COMMENT or DEL_ALL_SVC_COMMENTS commands.

Other commands worth mentioning are related to scheduling checks on demand. Often, it is necessary to request that a check be carried out as soon as possible; for example, when testing a solution.

You can create a script that schedules a check of a host, all services on that host, and a service on a different host, as follows:

 
#!/bin/sh NOW='date +%s' echo "[$NOW] SCHEDULE_HOST_CHECK;somehost;$NOW" \ >/var/nagios/rw/nagios.cmd echo "[$NOW] SCHEDULE_HOST_SVC_CHECKS;somehost;$NOW" \ >/var/nagios/rw/nagios.cmd echo "[$NOW] SCHEDULE_SVC_CHECK;otherhost;Service Name;$NOW" \ >/var/nagios/rw/nagios.cmd exit 0
 

The commands SCHEDULE_HOST_CHECK and SCHEDULE_HOST_SVC_CHECKS accept a host name and the time at which the check should be scheduled. The SCHEDULE_SVC_CHECK command requires the specification of a service description as well as the name of the host to schedule the check on.

Normal scheduled checks, such as the ones scheduled above, might not actually take place at the time that you scheduled them. Nagios also needs to take allowed time periods into account as well as checking whether checks were disabled for a particular object or globally for the entire Nagios.

There are cases when you'll need to force Nagios to do a check -- in such cases, you should use SCHEDULE_FORCED_HOST_CHECK, SCHEDULE_FORCED_HOST_SVC_CHECKS, and SCHEDULE_FORCED_SVC_CHECK commands. They work in exactly the same way as described above, but make Nagios skip the checking of time periods, and ensure that the checks are disabled for this particular object. This way, a check will always be performed, regardless of other Nagios parameters.

Other commands worth using are related to custom variables, introduced in Nagios 3. When you define a custom variable for a host, service, or contact, you can change its value on the file with the external command pipe.

As these variables can then be directly used by check or notification commands and event handlers, it is possible to make other applications or event handlers change these attributes directly without modifications to the configuration files.

How might this work? Suppose that the IT staff registers its presence via an application without any GUI. This application periodically sends information about the latest known IP address, and that information is then passed to Nagios assuming that the person is in the office. This would later be sent to a notification command to use that specific IP address while sending a message to the user.

Assuming that the user name is jdoe and the custom variable name is DESKTOPIP, the message that would be sent to the Nagios external command pipe would be as follows:

 
[1206096000] CHANGE_CUSTOM_CONTACT_VAR;jdoe;DESKTOPIP;12.34.56.78
 

This would cause a subsequent use of $_CONTACTDESKTOPIP$ to return a value of 12.34.56.78.

Nagios offers the CHANGE_CUSTOM_CONTACT_VAR, CHANGE_CUSTOM_HOST_VAR, and CHANGE_CUSTOM_ SVC_VAR commands for modifying custom variables in contacts, hosts, and services.

The commands explained above are just a small subset of the full capabilities of the Nagios external command pipe. For a complete list of commands, visit the External Command List.

Using SSH with Nagios

Remotely monitor servers with the Nagios check_by_ssh plugin  by Vincent Danen

January 20th, 2009
1 comment(s) If you are using Nagios to monitor remote servers, you have more than one method to execute checks, including the use of the check_by_ssh plugin. Vincent Danen tells you how to set up this plugin and the best way to secure it.

—————————————————————————————————

Nagios is a monitoring system that can be used to monitor a wide variety of services and criteria. Remotely, it can monitor anything that can be accessed remotely: Web sites, SMTP servers, FTP servers, and so forth. Locally, it can monitor even more: load average, swap and memory usage, disk space usage, hard drive temperatures, and the like. In fact, Nagios’ extensible nature makes writing plugins a breeze, so it is possible to monitor anything for which you are able to get representable data.

Unfortunately, if you wish to monitor local resource usage on a remote site it can be a little trickier. There are a number of ways this can be done, from using NSCA (Nagios Service Check Acceptor) to using NRPE (Nagios Remote Plugin Executor). These solutions may be best if you are able to compile and install software on the other machine, but if that is not a possibility, there are other solutions.

One such solution is to execute checks via SSH. If you are able to access the remote machine via SSH and have the ability to run programs out of a home directory, and the ability to set an SSH public key, then the check_by_ssh plugin is perhaps your best bet.

The first step is to ensure that the central Nagios server is able to connect to the remote host via SSH in a manner that does not require a password. This would require creating a password-less public/private keypair as the user running the Nagios service (typically “nagios”), sending the public key to the remote server, and then (as user “nagios”) logging into the remote system. For example:

nagios@nagiosserver:~/ > $ ssh-keygen -t dsa
Generating public/private dsa key pair.
Enter file in which to save the key (/home/nagios/.ssh/id_dsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/nagios/.ssh/id_dsa.
Your public key has been saved in /home/nagios/.ssh/id_dsa.pub.
The key fingerprint is:
6a:b4:cb:f1:7d:7b:7c:1b:c4:79:2a:5d:5a:16:da:b8 nagios@nagiosserver.com
nagios@nagiosserver:~/ > $ scp .ssh/id_dsa.pub user@remotehost.com:~/.ssh/authorized_keys
nagios@nagiosserver:~/ > $ ssh user@remotehost.com
user@remotehost:~/ > $

This creates the key without a passphrase and then copies the newly-created id_dsa.pub public key file to the remote host. Make sure that the ~user/.ssh directory already exists on the remote host and ensure that it is mode 0700 to protect it. If that is all correct, then using ssh to connect to the remote site as the specified user should yield a shell prompt. If so, then we can configure Nagios to use check_by_ssh.

One quick note: if you are able to create a dedicated account on the remote system for this, it would be best to do so. If, on the other hand, you are unable to, be sure to adequately protect your central Nagios server, because if anyone can obtain privileges as “nagios” on the central server, they will have an easy ticket to your user account on the remote server.

As well, copying whichever plugins you wish to execute on the remote machine into a ~/bin or ~/plugins directory would be the next step. To step up security, you can write a wrapper script to execute those specific commands and modify ~/.ssh/authorized_keys on the remote server to only execute the wrapper script, which would prevent that key from being used for anything other than executing Nagios checks.

On the central Nagios server, in the commands.cfg configuration file, define the new checks. The example below defines a new check_ssh_load command:

# 'check_ssh_load' command definition
define command {
        command_name    check_ssh_load
        command_line    $USER1$/check_by_ssh -H $HOSTADDRESS$ -C "/home/user/bin/check_load -w $ARG1$ -c $ARG2$"
}

This command will call the check_by_ssh plugin to connect to the specified host (via the $HOSTADDRESS$ macro) and execute the command /home/user/bin/check_load, which is the check_load plugin, on the remote machine; you will need to adjust the path to match the location of that plugin on the remote server. As well, if paths and/or usernames differ on remote servers and you plan to monitor more than one, you may need to define multiple commands, one for each server (or use macros).

Next, edit services.cfg and add the following:

define service {
        use                             local-service           ; check current load on machine
        hostgroup_name                  ssh-nagios-services
        service_description             Current Load
        check_command                   check_ssh_load!5.0,4.0,3.0!10.0,6.0,4.0
}

This defines a new service to execute for hosts in the ssh-nagios-services hostgroup. It calls the defined check_ssh_load command and will put the service in a warn state if the load average hits 5, and a critical state if it hits 10 (adjust to suit, of course).

Finally, edit hostgroups.cfg to create the ssh-nagios-services hostgroup. Systems added to this hostgroup will automatically begin to use the defined service.

define hostgroup {
        hostgroup_name  ssh-nagios-services
        alias           Nagios over SSH
        members         remote1,remote2
}

Here we define that remote1 and remote2 both belong to this hostgroup. As a result, both will start using the check_ssh_load command.

Using check_by_ssh is a convenient and secure way to execute Nagios plugins on remote servers. When all you can see of the status of a remote server is HTTP or SMTP availability, your view of the server is quite restricted. Being able to see local resource usage can allow you to spot problems, and correct them, before they are visible to users.

Get the PDF version of this tip here.

Delivered each Tuesday, TechRepublic’s free Linux and Open Source newsletter provides tips, articles, and other resources to help you hone your Linux skills. Automatically sign up today!

Automating Nagios service checks via SSH — Rudd-O.com

Expect to the rescue! Expect is a nifty program that automates tasks based on input and output. You should have installed Expect on your computer by now.
#!/usr/bin/expect -f

#Expect script to supply root/admin password for remote ssh server
#and execute command.
#This script needs three argument to(s) connect to remote server:
#password = Password of remote UNIX server, for root user.
#ipaddr = IP Addreess of remote UNIX server, no hostname
#scriptname = Path to remote script which will execute on remote server
#For example:
#./sshlogin.exp password 192.168.1.11 who
#------------------------------------------------------------------------
#Copyright (c) 2004 nixCraft project <http://cyberciti.biz/fb/>
#This script is licensed under GNU GPL version 2.0 or above
#-------------------------------------------------------------------------
#This script is part of nixCraft shell script collection (NSSC)
#Visit http://bash.cyberciti.biz/ for more information.
#----------------------------------------------------------------------
#set Variables

set ipaddr [lrange $argv 0 0]
   set password [lrange $argv 1 1]
   set scriptname [lrange $argv 2 2]
   set arg1 [lrange $argv 3 3]
   set arg2 [lrange $argv 4 4]
   set arg3 [lrange $argv 5 5]
   set arg4 [lrange $argv 6 6]
   set arg5 [lrange $argv 7 7]
   set arg6 [lrange $argv 8 8]
   set arg7 [lrange $argv 9 9]

#setting a timeout for the password prompt 5 seconds larger than the SSH ConnectionTimeout parameter

set timeout 35

#now connect to remote UNIX box (ipaddr) with given script to execute

set pid [spawn -noecho ssh -o "ConnectTimeout 30" -o "CheckHostIP no" -o "StrictHostKeyChecking no" $ipaddr $scriptname $arg1 $arg2 $arg3 $arg4 $arg5 $arg6 $arg7]
match_max 5000

#look for password prompt

log_user 0
   expect {
      "denied"                       {puts "CRITICAL: wrong SSH password" ; exit 2}
      "Name or service not known"    {puts "CRITICAL: cannot resolve SSH server name $ipaddr" ; exit 2}
      "Connection refused"           {puts "CRITICAL: SSH connection to $ipaddr refused" ; exit 2}
      "Connection timed out"         {puts "CRITICAL: SSH connection to $ipaddr timed out" ; exit 2}
      timeout                        {puts "CRITICAL: SSH server timed out while prompting for password" ; exit 2}
      "?assword:"
   }

# send password

send -- "$password"

# send blank line to make sure we get back to gui

send -- "r"
   expect "r"
   log_user 1

# now we wait up to 30 seconds

set timeout 30
   expect {
       timeout                   {puts "CRITICAL: execution of $scriptname timed out after 30 seconds" ; exit 2}
       eof
   }

set waitret [wait]
   catch {close}

set state [lindex $waitret 2]
   exit [lindex $waitret 3

What this script does is fairly easy to understand (once it's been explained to you!). It starts ssh with the passed arguments (a maximum of 8), against the server you specify and with a password you specify as well. It returns the status value of the remoted (remotely invoked) command.

The script suppresses any SSH output not related to the command, so beware: if the password is wrong, you will not be told. The script also make SSH not prompt for host authentication, so if you're finicky about security, perhaps this is the wrong approach for you. But it works for me, so let's go on. Again, keep reading.

HowtoSoftwareNagiosCheck By SSH - CNSWiki

 

Configuring Nagios with m4 « UNIX Administratosphere

19 February 2009\

When using m4 to configure Nagios, great advantages can be realized.  One of the easiest places to gain an advantage by using m4 is when defining a new host.

Typically, a new host not only has a host definition but a number of fairly standardized services - such as ping, FTP, telnet, SSH, and so forth.  Thus, when defining a new host configuration, you not only have to add a new host, but all of the relevant services as well - and may also include host extra info and service extra info also.

#----------------------------------------
# HOST: marco
#----------------------------------------
define host{
        use                     hpux-host               ; Name of host template
        host_name               marco
        address                 192.168.4.1
        }
define hostextinfo{
        host_name               marco
        action_url              http://marco-mp/
}
define service{
        use                             passive-service          ; Name of servi
        host_name                       marco
        service_description             System Load
        servicegroups                   Load
        }
define service{
        use                             hpux-service          ; Name of service
        host_name                       marco
        service_description             PING
        check_command                   check_ping!100.0,20%!500.0,60%
        }
define service{
        use                             hpux-service          ; Name of service
        host_name                       marco
        service_description             TELNET
        servicegroups                   TELNET
        check_command                   check_telnet
        }
define serviceextinfo{
        host_name                       marco
        service_description             TELNET
        action_url                      telnet://marco
}
define service{
        use                             hpux-service          ; Name of service
        host_name                       marco
        service_description             FTP
        servicegroups                   FTP
        check_command                   check_ftp
        }
define service{
        use                             hpux-service          ; Name of service
        host_name                       marco
        service_description             NTP
        servicegroups                   NTP
        check_command                   check_ntp
        }
define service{
        use                             hpux-service          ; Name of service
        host_name                       marco
        service_description             SSH
        servicegroups                   SSH
        check_command                   check_ssh
        }

Compare that output from the m4 code that generated it:

DEFHPUX(`marco',`192.168.4.1')

Another benefit is that if DEFHPUX is coded correctly (with each service independent - such as an m4 macro DOSSH for SSH) - then a single change to the m4 file, propogated to the Nagios config file, can alter a service for every HP-UX host (in this example).

Here is a possible definition of DEFHPUX:

define(`DEFHPUX',`
#----------------------------------------
# HOST: $1
#----------------------------------------
define host{
        use                     hpux-host               ; Name of host template
        host_name               $1
        address                 $2
        }
define hostextinfo{
        host_name               $1
        action_url              http://$1-mp/
}'
DOLOAD(`$1')
DOPING(`$1')
DOTELNET(`$1')
DOFTP(`$1')
DONTP(`$1')
DOSSH(`$1')

There is a lot more that m4 can do; this is just the tip of the iceberg.

Powered by ScribeFire.

NRPE plug-in

Nagios Howto: Using NRPE To Monitor Remote Services

This whitepaper is a continuation to the previously article, Nagios Howto: Notification Escalations, EventHandlers & Remote Service Monitoring With NRPE.

As previously mentioned, our focus assumes the use of Linux and a working Nagios installation. I highly suggest you go back to read the previous Nagios howto as it contains important information that we will building upon as we move into the second part of this whitepaper.

Thank you for rejoining if you have already read the first Crucial Nagios whitepaper.

As you have likely seen, the Nagios docs leave a bit to be desired when it comes to information on the NRPE plugin. In its simplest form, the NRPE plugin allows you to monitor any number of remote network devices and services using a single Nagios installation. However, when we combine EventHandlers with NRPE we then have the ability to repair our remote servers—self-healing servers. For now, we will focus our attention to NRPE and walk through the steps to properly configure your NRPE daemon.

Download NRPE Plugin

The NRPE source code and default plugin is available from the Nagios website. You will need to download the NRPE plugin and any other plugins to the remote machine that you intend to monitor:

cd /usr/src
wget http://umn.dl.sourceforge.net/sourceforge/nagiosplug/nagios-plugins-1.4.6.tar.gz
tar zxvf nagios-plugins-1.4.6.tar.gz
cd nagios-plugins-1.4.6

The instructions above will download and extract the the Nagios plugins, as well as change into that directory.

Build The Source Code

We now need to build the source code. This step needs to be done on each remote system that you plan to monitor. Follow the steps below to build the default plugin set:

./configure –prefix=/usr/local/nagios
make
make install

We now have /usr/local/nagios/libexec/ which contains the default plugin set.

At this time we need to download and install the NRPE daemon and plugin. The steps below detail the commands needed for execution:

cd /usr/src/
wget http://internap.dl.sourceforge.net/sourceforge/nagios/nrpe-2.7.tar.gz
tar zxvf nrpe-2.7.tar.gz
cd nrpe-2.7
./configure
make all

Move Things Around

Now we need to manually move the files into place:

cp src/nrpe /usr/local/nagios/libexec/
cp src/check_nrpe /usr/local/nagios/libexec/
cp sample-config/nrpe.cfg /usr/local/nagios/libexec/

We now have our executables in place and are ready to begin configuring the NRPE daemon on the remote system.

Configuration

The sample configuration file we copied above is a well documented file. You should take the time to read this file and familiarize yourself with the configuration options that we will be setting below. Open the nrpe.cfg file in the Nagios libexec directory in your favorite editor.

We are going to leave some default settings and change a few settings for our needs. Set the following configuration options as follows:

pid_file=/var/run/nrpe.pid
server_port=5666

# Set this if you want to nail NRPE to specific IP address
# server_address=192.168.1.1
nrpe_user=nagios
nrpe_group=nagios

# Set this to the remove Nagios installation IP address
allowed_hosts=127.0.0.1
dont_blame_nrpe=0
# command_prefix=/usr/bin/sudo
# Set this to 1 for logging in syslog
debug=0
command_timeout=60
connection_timeout=300
# allow_weak_random_seed=1

Thats it for the configuration of NRPE.

Commands

We now need to look at the available commands to NRPE. If you scroll to the bottom of the nrpe.cfg file you will see the default commands***. The commands are structured like so:

command[check_disk1]=/usr/local/nagios/libexec/check_disk -w 20 -c 10 -p /dev/hda1

Command names are completely arbitrary and can be created on the fly, e.g.:

command[check_disk2]=/usr/local/nagios/libexec/check_disk -w 20 -c 10 -p /dev/hdb1

Very simple format, check_disk1 is the command name located at /usr/local/nagios/libexec/check_disk with the arugments -w 20 -c 10 -p /dev/hda1. I used this particular command because it contains the disk check—this is the one command that you may possibly need to alter immediately for effective use. At the end of the command we see the path of the disk device to check on, /dev/hda1. You may not have this drive configuration so you will need to replace that with the path to your local disk setup. An easy way to figure this out is to issue the command df -h and use the returned entry for home as this is the primary usage space for most.

System Setup

At this point, we have completed configuring NRPE and we need to setup the system to accommodate Nagios.

First we need to setup permissions for the Nagios user.

adduser nagios
chown -R nagios.nagios /usr/local/nagios/

We’ve setup our Nagios user and changed the ownership of all the files under the nagios/ dir.

Now we need to edit the file /etc/services and add the following line:

nrpe 5666/tcp # NRPE

Now, we need to tell our inetd or xinetd about NRPE. Create a file in /etc/xinetd.d/ called nrpe, and add the following to that file:

# default: on
# description: NRPE
service nrpe {
    flags = REUSE
    socket_type = stream
    wait = no
    user = nagios
    server = /usr/local/nagios/libexec/nrpe
    server_args = -c /usr/local/nagios/libexec/nrpe.cfg –inetd
    log_on_failure += USERID
    disable = no

    # Change this to your primary Nagios server
    only_from = 127.0.0.1
}

This describes to the "super server" the various options necessary to launch the NRPE daemon when our remote Nagios monitoring system connects.

Now, open the /etc/hosts.allow file and add an entry for the IP address of your remote monitoring server. If you have a firewall, you will also want to configure it so that you allow remote connections from the IP address of your remote monitoring system to port 5666.

Restart your xinetd daemon to reload the configuration changes:

/etc/init.d/xinetd reload

Let’s test it out real quick to make sure nothing has gone wrong so far. From your remote monitoring server issue the following command:

telnet ip.address.of.remote.nrpe 5666

If the connection immediately closes you’ve got a problem and something isn’t right. If the socket opens and you are met with the following:

Escape character is ‘^]’.

Then y ou’re ready to move on. If you’ve got problems at this point, go back through each of the steps above and check for any errors in configuration. Since we’ve enabled DEBUG in our nrpe.cfg you can also view your syslog file for failure information.

Add New Host

We are now ready to add our new host to our primary Nagios installation. This is very straight forward and should only take a moment.

Back on the primary Nagios installation server we need to edit our hosts.cfg configuration file. The file is located in /usr/local/nagios/etc/hosts.cfg. This may change depending on your installation and organization of configuration files. Read the first part of this whitepaper for organization advise.

In the hosts.cfg file, add your new host object:

define host{
    use generic-host

    # Hostname of remote system
    host_name host.domain.com
    # A friendly name for this server
    alias Friendly name
    # Remote host IP address
    address 127.0.0.1
    check_command check-host-alive
    max_check_attempts 10
    notification_interval 30
    notification_period 24×7
    notification_options d,r
    # Your defined contact group name
    contact_groups admins
}

At this time our hosts.cfg file contains two hosts objects, the localhost which is running the Nagios application and our remote host which we will be monitoring.

We now want to add the service objects to our services.cfg file located in the same directory. Add the following single service to your services.cfg file:

define service{
    use generic-service
    # Hostname of remote system
    host_name host.domain.com
    service_description Primary Disk Usage
    is_volatile 0
    check_period 24×7
    max_check_attempts 3
    normal_check_interval 5
    retry_check_interval 1
    # Change to your contact group
    contact_groups admins
    notification_options w,u,c,r
    notification_interval 10
    notification_period 24×7
    check_command check_nrpe!check_disk1
}

You can view the Nagios documentation for the full details on each of these object configuration options. You will likely want to alter from the values shown above to your monitoring environment. However, we will take a look at that last line, the check_nrpe option.

check_nrpe

When monitoring remote services, we first issue a check_nrpe command followed by a ! and the command on the remote machine to run. This means that we are going to need an instance of check_nrpe on our Nagios Server. Simply follow the directions above to download, build, and install the NRPE check_nrpe script and the nrpe daemon. Once you have installed these on the Nagios primary server, then we can proceed.

Now that nrpe is installed on the primary Nagios server, and our new host and host service is configured, we can reload nagios service:

/etc/init.d/nagios reload

Web Interface

With the configuration read, you should now be able to access the web interface of Nagios. Under the Service Detail link you should see both the new remote host and the server/service we have setup to monitor. It is likely in an Unknown State at this time as the service has not been checked yet.

According to our service definition above, this service will be checked once every five minutes. If all has gone well, we should see the green in less than 5 minutes, which confirms proper installation and configuration of NRPE. In failure the service will go into a Soft State for two additional minutes. Once a Hard Failure state is achieved, you will see red and you should be able to check your Nagios log file in nagios/var/nagios.log for further information.

There are a lot of moving parts with this project so it is best to focus on a single server and a single service. Once you have a service properly configured it is a short step to configure the next service. Simply copy the service object created above and change the nrpe_check!’command_issued‘.

What You Can Do

Many things can be monitored with NRPE that can not be monitored remotely by Nagios without NRPE. These include:

And any thing else that doesn’t run as a public service on the server.

Obviously, the advantage to remotely monitoring these server objects in a central location is that a problem may be much more quickly identified. This combined with the previous whitepaper’s escalation procedures provide an effective response tool for reactively monitoring remote servers.

Remember, any of the commands that you have in the nagios/libexec folder are available to NRPE. To run these commands on the remote server, you simply need to setup the command in the nrpe.cfg file on the remote server. Here is an example using check_load:

command[check_load]=/usr/local/nagios/libexec/check_load -w 15,10,5 -c 30,25,20

The -w and the -c are the Warn and Critical thresholds

Follow the steps above to add the new service,check_load, to your services.cfg file. Reload Nagios and that’s that.

EventHandler

In the next whitepaper, we will change our focus to the Nagios EventHandler. I will demonstrate to you how to repair problems that Nagios encounters before even contacting a single human. At Crucial Web Hosting, we make extensive use of the EventHandler object in Nagios and we credit it for a very happy support team. Using the EventHandler objects we can diagnose and repair common problems that occur on local and remote servers in a matter of minutes and seconds as opposed to hours and days.

We will be performing root tasks using the sudo method and we will create a simple custom EventHandler on a remote server thus demonstrating how you can roll your own Nagios plugins.

In the next whitepaper you will learn how to make your servers heal themselves with no human interaction!

If you missed the first whitepaper in the series I am writing, you can access it here. I look forward to your questions and comments.

Old News ;-)

[Aug 25, 2009] Zenoss Vs Nagios - TechSoup

Just wondering if anyone has ant opinion about Zenoss vs Nagios. I have helped people set up one or the other, but I have not used either of them over time.

Zenoss looks like it has a nicer user interface, but it was less intuitive for how it was set up. (I used the NagioSql tool for administering much of Nagios, and that made the setup much nicer)

[Aug 25, 2009] check_mk

freshmeat.net

check_mk is a general purpose Nagios-plugin for retrieving data. It adopts a new approach for collecting data from operating systems and network components. It obsoletes NRPE, check_by_ssh, NSClient, and check_snmp and it has many benefits, the most important of which are significant reduction of CPU usage on the Nagios host and automatic inventory of items to be checked on hosts. The larger your Nagios installation is, the more helpful these improvements.

[Aug 1, 2009] Deploying Nagios in a Large Enterprise Environment, at USENIX LISA '07

Interesting presentation about splitting Nagios into multiple domains (I think that each 100 servers requires separate instance if check are extensive) and using passive checks to avoid bottleneck of "agent less probes".   Configuration file generation can be a big help in case servers are similar.  Large deployment requires configuration management of Nagios config files. It's interesting how they "reinvented the bicycle" for some concepts like querying of alerts, etc which should be in the enterprise monitoring system from the very beginning :-)

[Aug 1, 2009] Notifications and Events in Nagios 3.0- part2

This is the second part of the two part series by Wojciech Kocjan in which we have made an effort to cover everything in notifications and events in Nagios 3.0.The first part covered:

In this article, we will cover the following sub-topics:
  • External Commands
  • Event Handlers
  • Modifying Notifications
  • Adaptive Monitoring

External Commands

Nagios offers a very powerful mechanism for receiving events and commands from external applications—the external commands pipe. This is a pipe file created on a file system that Nagios uses to receive incoming messages. The name of the file is rw/nagios.cmd and it is located in the directory passed as the localstatedir option during compilation. Following the compilation and installation instructions and the given guidelines, the file name will be /var/nagios/rw/ nagios.cmd.

The communication does not use any authentication or authorization—the only requirement is to have write access to the pipe file. An external command file is usually writable by the owner and the group; the usual group used is nagioscmd. If you want a user to be able to send commands to the Nagios daemon, simply add that user to this group.

A small limitation of the command pipe is that there is no way to get any results back and so it is not possible to send any query commands to Nagios. Therefore, by just using the command pipe, you have no verification that the command you have just passed to Nagios has actually been processed, or will be processed soon. It is, however, possible to read the Nagios log file and check if it indicates that the command has been parsed correctly, if necessary.

An external command pipe is used by the web interface to control how Nagios works. The web interface does not use any other means to send commands or apply changes to Nagios. This gives a good understanding of what can be done with the external command pipe interface.

From the Nagios daemon perspective, there is no clear distinction as to who can perform what operations. Therefore, if you plan to use the external command pipe to allow users to submit commands remotely, you need to make sure that the authorization is in place as well so that it is not possible for unauthorized users to send potentially dangerous commands to Nagios.

The syntax for formatting commands is easy. Each command must be placed on a single line and end with a newline character.

Troubleshooting Nagios 3.0
Tuesday, April 7, 2009 | Linux Servers
 

In this article by Wojciech Kocjan, we will learn about troubleshooting Nagios 3.0 which includes troubleshooting the web interface, passive checks, SSH-Based checks, and NRPE.The article includes various possible errors along with their solutions and detailed explanations for each error listed out.

See More

 

Notifications and Events in Nagios 3.0- part2 Tuesday, May 12, 2009 | Linux Servers

This is the second part of the two part series by Wojciech Kocjan in which we have made an effort to cover everything in notifications and events in Nagios 3.0.The first part covered:

In this article, we will cover the following sub-topics: See More

Learning Nagios 3.0 Table of Contents Monday, October 27, 2008 | All
See More

Notifications and Events in Nagios 3.0-part1 Monday, May 11, 2009 | Linux Servers
 

This is a 2-part series by Wojciech Kocjan. We have made an attempt to cover all about events and notifications in Nagios 3.0 in detail in this series. The following sub-topics will be covered as a part of this series:

See More

Troubleshooting Nagios 3.0 Tuesday, April 7, 2009 | WordPress
 

In this article by Wojciech Kocjan, we will learn about troubleshooting Nagios 3.0


See More

Troubleshooting Nagios 3.0 Tuesday, April 7, 2009 | Linux Servers
 

In this article by Wojciech Kocjan, we will learn about troubleshooting Nagios 3.0 which includes troubleshooting the web interface, passive checks, SSH-Based checks, and NRPE.The article includes various possible errors along with their solutions and detailed explanations for each error listed out.


See More Passive Checks and NSCA (Nagios Service Check Acceptor) Wednesday, November 19, 2008 | Networking & Telephony
Nagios is a very powerful platform because it is easy to extend. A great feature that Nagios offers is the ability for third-party software or other Nagios instances to report information on the status of services or hosts. This way, Nagios does not need to schedule and run checks by itself, but other applications can report information as it is available to them. This means that your applications can send problem reports directly to Nagios, instead of just logging them. In this way, your applications can benefit from powerful notification systems as well as dependency tracking. In this article by Wojciech Kocjan, we will see how this mechanism can also be used to receive failure notifications from other services or machines—for example, SNMP traps.
See More

[May 6, 2009] Custom checks and notifications for Nagios  by Mike Diehl

September 11th, 2008 | Linux Journal

A while back, I wrote an article for Linux Journal's web edition entitled “Howto be a good (and lazy) System Administrator.” A couple astute readers, after reading the article, asked if I was familiar with the Nagios monitoring system, and I am. I've been using Nagios for a few years now.

I had intended to write this article as a How-to on getting Nagios configured and running for the first time. However, it turns out that the documentation that comes with Nagios is really pretty good. And even if you do have problems, and I did, the user community is also quite responsive. So, rather than beating a dead horse, (with sympathy to horse lovers) I decided to continue the Good and Lazy Administrator Theme and discuss extending Nagios with custom service checks and custom notifications.

Nagios uses a plug-in mechanism to implement all of it's server and service checks as well as all of it's notifications. This is good news for hackers, as it allows us to build new functionality that either no one else has though of, or has need of. I wrote a couple scripts for my Nagios system. One does a custom service check to see if I have voicemail waiting for me at the Help Desk, and the other does a custom notification by telephone. Before I go on, I should give a little bit of background.

I maintain several servers, both for myself and for customers. These servers range from web servers to phone systems running Asterisk. Just like most System Administrators, I don't have any “optional” servers or services; the stuff just has to work, and when it isn't, I need to know. But I'll tell you, I'm not interested in sitting at the desk watching the Help Desk phone or the monitoring screen. I'm either too lazy, or too busy. Either way, that's what silicon is for, right?

My phone system at the house runs on Asterisk. You can read more about my home infrastructure at http://www.linuxjournal.com/article/9111. My Nagios server runs on the same server, so it just makes sense to integrate the two services.

I've created a Nagios script that monitors the Help Desk voicemailbox and sets a service alert if there are any critical alerts in Nagios. I've also written a script that can call me, perhaps on my cellphone, in the event of a service outage. With these two scripts in place, I can get a call on my cellphone any time someone calls my Help Desk and leaves a message. I can also get a call if any of my monitored services fail. Theoretically, I can be at a park playing with my boys and know that my servers are happy... until the cellphone rings.

I understand that I have kind of a unique situation, but the same concept is applicable in a business production environment, so lets get down to looking at code.

First, let's talk about the Help Desk monitoring script. Essentially, this script checks to see if there are any files in the INBOX in the Help Desk mailbox. Here is the code:

#!/usr/bin/perl -w

local *DIR;
my ($file, $error);

$error = 0;

opendir DIR, "/var/spool/asterisk/voicemail/customers/611/INBOX/"
       or die("Error: Permission denied\n");

while ($file = readdir(DIR)) {
       if ($file eq ".") { next; }
       if ($file eq "..") { next; }

       $error++;
}

$error = $error/4;

if (!$error) {
       print "OK\n";
       exit 0;
} else {
       print "CRITICAL: $error\n";
}

exit 2;

Of course, you need to make sure that the Nagios user has access to the Asterisk voicemailbox, but that can be taken care of by setting the script set-uid. The script, as you can see, is pretty simple. If there are any other files in the directory, the script assumes that there is a voicemail and sends a CRITICAL alert to Nagios. Otherwise everything is OK.

To enable Nagios to use this check script, we need to define it in checkcommands.cfg. Here is the definition, I used:

define command{
       command_name    check_help
       command_line    /etc/nagios/local/check_611.pl
}

Now, I can refer to the check_help check script in the services.cfg file. Here's how I did it:

define service {
       use generic-service
       name                    Help_Desk
       host_name               my_server
       service_description     Help Desk Voicemail
       check_command           check_help
       register 1
}

With this configuration in place, Nagios can indicate an alarm any time there is voicemail in the Help Desk mailbox. But that's only half of what I promised to write about. The next script allows Nagios to call me to let me know that I've got a fire to put out. Here is that script:

#!/usr/bin/perl

foreach $main::phone ("15055551234") {

       $main::call = <
MaxRetries: 0
RetryTime: 1
WaitTime: 120
Account: Enterprise
Context: apps
Extension: OUTAGE
Priority: 1
EOF
;

       open FILE, ">/tmp/outage.call";
       print FILE $main::call;
       close FILE;

       system("mv /tmp/outage.call /var/spool/asterisk/outgoing");
}

As you can see, this script isn't complicated, either. It simply creates an Asterisk “call file” and puts it in Asterisk's outgoing spool directory. The script is capable of calling multiple numbers... just in case. It's important to that the call file be created in another directory and moved into the spool directory. Otherwise bad things can happen if Asterisk tries to read the file while the script is still writing it.

Obviously this script relies on some configuration in the Asterisk dial plan. Here is the relevant part of the dial plan:

exten => OUTAGE,1,answer
exten => OUTAGE,2,playback(/etc/asterisk/sounds/OUTAGE)
exten => OUTAGE,3,hangup

At this point, you're probably realizing that I'm not doing anything complicated. All that is needed from Asterisk's point of view is an audio message in /etc/asterisk/sounds/OUTAGE (.wav or .au) that indicates that something is on fire. Asterisk will select the most reasonable file extension and play the file when the call is answered.

So all that is left to do is configure Nagios to use this notification method. This is configured in the misccommands.cfg file. Here is how I did it:

# 'notify_by_phone' command definition
define command{
       command_name    notify_by_phone
       command_line    /etc/nagios/local/notify_by_phone.pl
}

Now that all of the configuration is done, we restart Nagios and reload the Asterisk dial plan. To do this, we type “/etc/init.d/nagios restart” at the command line and “extensions reload” at the Asterisk console.

So now, anytime I have voicemail at the Help Desk, it's indicated in the Nagios monitoring screen as a critical alert. Also, anytime any of my servers or services are unavailable, I can get a phone call on either my home phone or my cell phone. This means that my customers don't HAVE to have those phone numbers and I can still provide quality service to them.

Now I realize that I have a unique situation, but I hope that this article serves as an example of how to create custom Nagios service checks and notifications, as well as hinting at some of the integration options available in Asterisk.
__________________________
 

Mike Diehl is a freelance Computer Nerd specializing in Linux administration, programing, and VoIP. Mike lives in Albuquerque, NM. with his wife and 3 sons. He can be reached at mdiehl@diehlnet.com

 

NagiosExchange Search Results

check_by_telnet

http://www.ighor.com/?item=3

 

check cpu, memory, hard-disk by telnet connection

 

Ganglia and Nagios, Part 2 Monitor enterprise clusters with Nagios

This is the second article in a two-part series that looks at a hands-on approach to monitoring a data center using the open source tools Ganglia and Nagios. In Part 2, learn how to install and configure Nagios, the popular open source computer system and network monitoring application software that watches hosts and services, alerting users when things go wrong. The article also shows you how to unite Nagios with Ganglia (from Part 1) and add two other features to Nagios for standard clusters, grids, and clouds to help with monitoring network switches and the resource manager.

Recap of Part 1

Data centers are growing and administrative staffs are shrinking, necessitating efficient monitoring tools for compute resources. Part 1 of this series discussed the benefits of using Ganglia and Nagios together, then showed you how to install and extend Ganglia with homemade monitoring scripts.

Recall from Part 1 the multiple definitions of monitoring (depending on the implier and the inferrer):

You can find code to monitor exactly what you want to monitor and that code can be of the open source variety. The most difficult part of using open source monitoring tools comes when you attempt to implement an install and puzzle out a configuration that works well for your environment. Two major problems with open source (and commercial) monitoring tools are the following:

  1. No tool will monitor everything you want the way you want it.
  2. Much customization could be required to get the tool working in your data center exactly how you want it.

Ganglia is a tool that monitors data centers and is used heavily in high-performance computing environments (but it's attractive for other environments too like clouds, render farms, and hosting centers). It is more concerned with gathering metrics and tracking them over time compared with Nagios's focus as an alerting mechanism. Ganglia used to require an agent to run on every host to gather information from it, but now metrics can be obtained from just about anything through Ganglia's spoofing mechanism. Ganglia doesn't have a built-in notification system, but it was designed to support scalable built-in agents on target hosts.

After reading Part 1, you could install Ganglia, as well as answer the monitoring questions that different user groups tend to ask. You could also configure the basic Ganglia setup, use the Python modules to extend functionality with IPMI (the Intelligent Platform Management Interface), and use Ganglia host spoofing to monitor IPMI.

Now, let's look at Nagios.

Introducing Nagios

This part shows you how to install Nagios and tie Ganglia back into it. We're going to add two features to Nagios that'll help your monitoring efforts in standard clusters, grids, clouds (or whatever your favorite buzzword is for scale-out computing). The two features are all about:

In this case, we'll be monitoring TORQUE. When we are finished, you'll have a framework to control the monitoring system of your entire data center.

Nagios, like Ganglia, is used heavily in HPC and other environments, but Nagios is more of an alerting mechanism that Ganglia (which is more focused on gathering and tracking metrics). Nagios previously only polled information from its target hosts, but has recently developed plug-ins that allow it to run agents on those hosts. Nagios has a built-in notification system.

Now let's install Nagios and set up a baseline monitoring system of an HPC Linux® cluster to address the three different monitoring perspectives:

[Apr 21, 2009] check_openmanage freshmeat.net

check_openmanage is a plugin for Nagios that checks the hardware health of Dell PowerEdge and PowerVault servers. It uses the Dell OpenManage Server Administrator (OMSA) software to accomplish this task. check_openmanage can be used remotely with SNMP or locally with NRPE. The plugin checks the health of the storage subsystem, power supplies, memory modules, temperature probes, etc., and gives an alert if any of the components are faulty or operate outside normal parameters.

Changes: The --global option was added, which turns on checking of everything. If used with SNMP, the global system health status is also probed, to protect the user against bugs in the plugin. If used with omreport, the overall chassis health is used. Support for SNMP version 3 was added. Checking of esmhealth was added, which checks the overall health of the ESM log, i.e. the fill grade. Alert log reporting was fixed to use the same format as for the ESM log. Output messages are now sorted by severity. Minor changes were made in how out-of-date controller firmware/driver is reported

[Dec 22, 2008] Nagiosgraph

Nagiosgraph is an add-on for Nagios. It collects service perfdata in RRD format, and displays the resulting graphs via CGI.

Custom checks and notifications for Nagios Linux Journal

September 11th, 2008 by Mike Diehl in

I've created a Nagios script that monitors the Help Desk voicemailbox and sets a service alert if there are any critical alerts in Nagios. I've also written a script that can call me, perhaps on my cellphone, in the event of a service outage. With these two scripts in place, I can get a call on my cellphone any time someone calls my Help Desk and leaves a message. I can also get a call if any of my monitored services fail. Theoretically, I can be at a park playing with my boys and know that my servers are happy... until the cellphone rings.

I understand that I have kind of a unique situation, but the same concept is applicable in a business production environment, so lets get down to looking at code.

First, let's talk about the Help Desk monitoring script. Essentially, this script checks to see if there are any files in the INBOX in the Help Desk mailbox. Here is the code:

#!/usr/bin/perl -w

local *DIR;
my ($file, $error);

$error = 0;

opendir DIR, "/var/spool/asterisk/voicemail/customers/611/INBOX/"
       or die("Error: Permission denied\n");

while ($file = readdir(DIR)) {
       if ($file eq ".") { next; }
       if ($file eq "..") { next; }

       $error++;
}

$error = $error/4;

if (!$error) {
       print "OK\n";
       exit 0;
} else {
       print "CRITICAL: $error\n";
}

exit 2;

Of course, you need to make sure that the Nagios user has access to the Asterisk voicemailbox, but that can be taken care of by setting the script set-uid. The script, as you can see, is pretty simple. If there are any other files in the directory, the script assumes that there is a voicemail and sends a CRITICAL alert to Nagios. Otherwise everything is OK.

To enable Nagios to use this check script, we need to define it in checkcommands.cfg. Here is the definition, I used:

define command{
       command_name    check_help
       command_line    /etc/nagios/local/check_611.pl
}

Now, I can refer to the check_help check script in the services.cfg file. Here's how I did it:

define service {
       use generic-service
       name                    Help_Desk
       host_name               my_server
       service_description     Help Desk Voicemail
       check_command           check_help
       register 1
}

With this configuration in place, Nagios can indicate an alarm any time there is voicemail in the Help Desk mailbox. But that's only half of what I promised to write about. The next script allows Nagios to call me to let me know that I've got a fire to put out. Here is that script:

#!/usr/bin/perl

foreach $main::phone ("15055551234") {

       $main::call = <
MaxRetries: 0
RetryTime: 1
WaitTime: 120
Account: Enterprise
Context: apps
Extension: OUTAGE
Priority: 1
EOF
;

       open FILE, ">/tmp/outage.call";
       print FILE $main::call;
       close FILE;

       system("mv /tmp/outage.call /var/spool/asterisk/outgoing");
}

As you can see, this script isn't complicated, either. It simply creates an Asterisk “call file” and puts it in Asterisk's outgoing spool directory. The script is capable of calling multiple numbers... just in case. It's important to that the call file be created in another directory and moved into the spool directory. Otherwise bad things can happen if Asterisk tries to read the file while the script is still writing it.

Obviously this script relies on some configuration in the Asterisk dial plan. Here is the relevant part of the dial plan:

exten => OUTAGE,1,answer
exten => OUTAGE,2,playback(/etc/asterisk/sounds/OUTAGE)
exten => OUTAGE,3,hangup

At this point, you're probably realizing that I'm not doing anything complicated. All that is needed from Asterisk's point of view is an audio message in /etc/asterisk/sounds/OUTAGE (.wav or .au) that indicates that something is on fire. Asterisk will select the most reasonable file extension and play the file when the call is answered.

So all that is left to do is configure Nagios to use this notification method. This is configured in the misccommands.cfg file. Here is how I did it:

# 'notify_by_phone' command definition
define command{
       command_name    notify_by_phone
       command_line    /etc/nagios/local/notify_by_phone.pl
}

Now that all of the configuration is done, we restart Nagios and reload the Asterisk dial plan. To do this, we type “/etc/init.d/nagios restart” at the command line and “extensions reload” at the Asterisk console.

So now, anytime I have voicemail at the Help Desk, it's indicated in the Nagios monitoring screen as a critical alert. Also, anytime any of my servers or services are unavailable, I can get a phone call on either my home phone or my cell phone. This means that my customers don't HAVE to have those phone numbers and I can still provide quality service to them.

Now I realize that I have a unique situation, but I hope that this article serves as an example of how to create custom Nagios service checks and notifications, as well as hinting at some of the integration options available in Asterisk.
__________________________
 

Mike Diehl is a recently self-employed Computer Nerd and lives in Albuquerque, NM. with his wife and 3 sons. He can be reached at mdiehl@diehlnet.com

Linux.com Using external commands in Nagios

By Wojciech Kocjan on November 20, 2008 (8:00:00 PM)
System monitoring tool Nagios offers a powerful mechanism for receiving events and commands from external applications. External commands are usually sent from event handlers or from the Nagios Web interface. You will find external commands most useful when writing event handlers for your system, or when writing an external application that interacts with Nagios.

This article is excerpted from the newly published book Learning Nagios 3.0 from Packt Publishing.

The external commands pipe is a pipe file created on a filesystem that Nagios uses to receive incoming messages. The communication does not use any authentication or authorization -- the only requirement is to have write access to the pipe file, rw/nagios.cmd, which is located in the directory passed as the localstatedir option during compilation.

An external command file is usually writable by the owner and the group; the usual group used is nagioscmd. If you want a user to be able to send commands to the Nagios daemon, simply add that user to this group.

A small limitation of the command pipe is that there is no way to get any results back, so it is not possible to send any query commands to Nagios. Therefore, by just using the command pipe, you have no verification that the command you have passed to Nagios has been processed, or will be processed soon. It is, however, possible to read the Nagios log file and check whether it indicates that the command has been parsed correctly.

The Nagios Web interface uses an external command pipe to control how Nagios works. The Web interface does not use any other means to send commands or apply changes to Nagios.

From the Nagios daemon perspective, there is no clear distinction as to who can perform what operations. Therefore, if you plan to use the external command pipe to allow users to submit commands remotely, you need to make sure that authorization is in place so that unauthorized users cannot send potentially dangerous commands to Nagios.

The syntax for formatting commands is easy. Each command must be placed on a single line and end with a newline character. The syntax is as follows:

[TIMESTAMP] COMMAND_NAME;argument1;argument2;...;argumentN

TIMESTAMP is written as Unix time -- that is, the number of seconds since 1970-01-01 00:00:00. You can create this by using the date command. Most programming languages also offer the means to get the current Unix time.

Commands are written in upper case. The arguments depend on the actual command. For example, to add a comment to a host stating that it has passed a security audit, you can use the following shell command:

echo "['date +%s'] ADD_HOST_COMMENT;somehost;1;Security Audit; This host has passed security audit on 'date +%Y-%m-%d'" >/var/nagios/rw/nagios.cmd

This will send an ADD_HOST_COMMENT command to Nagios over the external command pipe. Nagios will then add a comment to the host, somehost, stating that the comment originated from Security Audit. The first argument specifies the host name to add the comment to; the second tells Nagios if this comment should be persistent. The next argument describes the author of the comment, and the last argument specifies the actual comment text.

Similarly, adding a comment to a service requires the use of the ADD_SVC_COMMENT command. The command's syntax is similar to that of the ADD_HOST_COMMENT command except that the command requires the specification of the host name and service name.

You can also delete a single comment or all comments using the DEL_HOST_ COMMENT, DEL_ALL_HOST_COMMENTS, and DEL_SVC_COMMENT or DEL_ALL_SVC_COMMENTS commands.

Other commands worth mentioning are related to scheduling checks on demand. Often, it is necessary to request that a check be carried out as soon as possible; for example, when testing a solution.

You can create a script that schedules a check of a host, all services on that host, and a service on a different host, as follows:

#!/bin/sh NOW='date +%s'

echo "[$NOW] SCHEDULE_HOST_CHECK;somehost;$NOW" \ >/var/nagios/rw/nagios.cmd

echo "[$NOW] SCHEDULE_HOST_SVC_CHECKS;somehost;$NOW" \ >/var/nagios/rw/nagios.cmd

echo "[$NOW] SCHEDULE_SVC_CHECK;otherhost;Service Name;$NOW" \ >/var/nagios/rw/nagios.cmd exit 0

The commands SCHEDULE_HOST_CHECK and SCHEDULE_HOST_SVC_CHECKS accept a host name and the time at which the check should be scheduled. The SCHEDULE_SVC_CHECK command requires the specification of a service description as well as the name of the host to schedule the check on.

Normal scheduled checks, such as the ones scheduled above, might not actually take place at the time that you scheduled them. Nagios also needs to take allowed time periods into account as well as checking whether checks were disabled for a particular object or globally for the entire Nagios.

There are cases when you'll need to force Nagios to do a check -- in such cases, you should use SCHEDULE_FORCED_HOST_CHECK, SCHEDULE_FORCED_HOST_SVC_CHECKS, and SCHEDULE_FORCED_SVC_CHECK commands. They work in exactly the same way as described above, but make Nagios skip the checking of time periods, and ensure that the checks are disabled for this particular object. This way, a check will always be performed, regardless of other Nagios parameters.

Other commands worth using are related to custom variables, introduced in Nagios 3. When you define a custom variable for a host, service, or contact, you can change its value on the file with the external command pipe.

As these variables can then be directly used by check or notification commands and event handlers, it is possible to make other applications or event handlers change these attributes directly without modifications to the configuration files.

How might this work? Suppose that the IT staff registers its presence via an application without any GUI. This application periodically sends information about the latest known IP address, and that information is then passed to Nagios assuming that the person is in the office. This would later be sent to a notification command to use that specific IP address while sending a message to the user.

Assuming that the user name is jdoe and the custom variable name is DESKTOPIP, the message that would be sent to the Nagios external command pipe would be as follows:

[1206096000] CHANGE_CUSTOM_CONTACT_VAR;jdoe;DESKTOPIP;12.34.56.78

This would cause a subsequent use of $_CONTACTDESKTOPIP$ to return a value of 12.34.56.78.

Nagios offers the CHANGE_CUSTOM_CONTACT_VAR, CHANGE_CUSTOM_HOST_VAR, and CHANGE_CUSTOM_ SVC_VAR commands for modifying custom variables in contacts, hosts, and services.

The commands explained above are just a small subset of the full capabilities of the Nagios external command pipe. For a complete list of commands, visit the External Command List.

Using Nagios to Monitor Networks

Posted by philcore on Mon 28 Nov 2005 at 12:23
Tags: ,

Nagios is a powerful, modular network monitoring system that can be used to monitor many network services like smtp, http and dns on remote hosts. It also has support for snmp to allow you to check things like processor loads on routers and servers. I couldn't begin to cover all of the things that nagios can do in this article, so I'll just cover the basics to get you up and running.

apt-get install nagios-text
First we need to define people that will be notified, and define how they should be notified. In the example below, I define two users, joe and paul. Joe is the network guru and cares about routers and switches. Paul is the systems guy, and he cares about servers. Both will be notified via email and by pager. Note that if you are going to monitor your email server, you will want to use another notification method besides email. If your email server is down, you can't send anybody an email to notify them! :) In that case you will want to use a pager server to send a text message to a phone or pager, or set up a second nagios monitor that uses a different mail server to send email.

Edit /etc/nagios/contacts.cfg and add the following users:

define contact{
    contact_name                    joe
    alias                           Joe Blow
    service_notification_period     24x7
    host_notification_period        24x7
    service_notification_options    w,u,c,r
    host_notification_options       d,u,r
    service_notification_commands   notify-by-email,notify-by-pager
    host_notification_commands      host-notify-by-email,host-notify-by-epager
    email                           joe@yourdomain.com
    pager                           5555555@pager.yourdomain.com
    }

define contact{
    contact_name                    paul
    alias                           Paul Shiznit
    service_notification_period     24x7
    host_notification_period        24x7
    service_notification_options    w,u,c,r
    host_notification_options       d,u,r
    service_notification_commands   notify-by-email,notify-by-epager
    host_notification_commands      host-notify-by-email,host-notify-by-epager
    email                           paul@yourdomain.com
    pager                           5556666@pager.yourdomain.com
    }
Now add the users to groups.
In /etc/nagios/contactgroups.cfg add the following:
define contactgroup{
    contactgroup_name   router_admin
    alias               Network Administrators
    members             joe
}

define contactgroup{
    contactgroup_name   server_admin
    alias               Systems Administrators
    members             paul
}
You can add multiple members to a contact group by listing comma separated users.

Now to define some hosts to monitor. For my example, I define two machines, a mail server and a router.

Edit /etc/nagios/hosts.cfg and add:

define host{
    use                     generic-host
    host_name               gw1.yourdomain.com
    alias                   Gateway Router
    address                 10.0.0.1
    check_command           check-host-alive
    max_check_attempts      20
    notification_interval   240
    notification_period     24x7
    notification_options    d,u,r
    }

define host{
    use                     generic-host
    host_name               mail.yourdomain.com
    alias                   Mail Server
    address                 10.0.0.100
    check_command           check-host-alive
    max_check_attempts      20
    notification_interval   240
    notification_period     24x7
    notification_options    d,u,r
    }
Now we add the hosts to groups. I define groups called 'routers' and 'servers' and add the router and mail server respectively.

Edit /etc/nagios/hostgroups.cfg

define hostgroup{
    hostgroup_name  routers
    alias           Routers
    contact_groups  router_admin
    members         gw1.yourdomain.com
    }

define hostgroup{
    hostgroup_name  servers
    alias           Servers
    contact_groups  server_admin
    members         mail.yourdomain.com
    }
Again, for multiple members, just use a comma separated list of hosts.

Next define services to monitor on each of the hosts. Nagios has many built-in plugins for monitoring. On a debian sarge system, they are stored in /usr/lib/nagios/plugins. Here we want to monitor the smtp service on the mail server, and do ping checks on the router.

Edit /etc/nagios/services.cfg

define service{
    use                     generic-service 
    host_name               mail.yourdomain.com
    service_description     SMTP
    is_volatile             0
    check_period            24x7
    max_check_attempts      3
    normal_check_interval   5
    retry_check_interval    1
    contact_groups          server_admin
    notification_interval   240
    notification_period     24x7
    notification_options    w,u,c,r
    check_command           check_smtp
    }

define service{
    use                     generic-service 
    host_name               gw1.yourdomain.com
    service_description     PING
    is_volatile             0
    check_period            24x7
    max_check_attempts      3
    normal_check_interval   5
    retry_check_interval    1
    contact_groups          router_admin
    notification_interval   240
    notification_period     24x7
    notification_options    w,u,c,r
    check_command           check_ping!100.0,20%!500.0,60%
    }
And that's it. To test your configurations, you can run
nagios -v /etc/nagios/nagios.cfg
If all is well we can restart nagios and move on to the apache side to get a visual view of the monitor.
/etc/init.d/nagios restart
Assuming you have a working apache install, you can add the apache.conf file included in the nagios package to set up the nagios cgi administration interface. The web interface is not required to run nagios, but it is definitely worth setting it up. The simplest way to get it up and running is to copy the supplied conf file over to our apache installation. On my system, I'm running apache2. Systems running apache 1.3.xx will have slightly different setups.
cp /etc/nagios/apache.conf /etc/apache2/sites-enabled/nagios
Of course you may want to set it up as a virtual server, but I leave that as an exercise for the reader. Now you will want to set up an allowed user to view the cgi interface. By default, nagios issues full administrative access to the nagiosadmin user. Nagios uses apache htpasswd style authentication. So here we add a user and password to the default nagios htpasswd file. Here we add the user nagiosadmin with password mypassword to the nagios htpasswd file.
htpasswd2 -nb nagiosadmin mypassword >> /etc/nagios/htpasswd.users
You should now be able to restart apache and logon to

http://your.nagios.server/nagios

Nagios is a very powerful tool for monitoring networks. I've only touched on the basics here, but it should be enough to get you up and running. Hopefully, once you do, you'll start experimenting with all the cool features and plugins that are available. The documentation included in the cgi interface is very detailed and helpful.

[Sep 11, 2008] Custom checks and notifications for Nagios

The author uses Perl for the plug-in
“Howto be a good (and lazy) System Administrator.” A couple astute readers, after reading the article, asked if I was familiar with the Nagios monitoring system, and I am. I've been using Nagios for a few years now.

I had intended to write this article as a How-to on getting Nagios configured and running for the first time. However, it turns out that the documentation that comes with Nagios is really pretty good. And even if you do have problems, and I did, the user community is also quite responsive. So, rather than beating a dead horse, (with sympathy to horse lovers) I decided to continue the Good and Lazy Administrator Theme and discuss extending Nagios with custom service checks and custom notifications.

Nagios uses a plug-in mechanism to implement all of it's server and service checks as well as all of it's notifications. This is good news for hackers, as it allows us to build new functionality that either no one else has though of, or has need of. I wrote a couple scripts for my Nagios system. One does a custom service check to see if I have voicemail waiting for me at the Help Desk, and the other does a custom notification by telephone. Before I go on, I should give a little bit of background.

[Sep 10, 2008] check_logfiles

Perl plugin: check_logfiles is a plugin for Nagios which checks logfiles for defined patterns

check_logfiles is a plugin for Nagios which checks logfiles for defined patterns. It is capable of detecting logfile rotation. If you tell it how the rotated archives look, it will also examine these files. Unlike check_logfiles, traditional logfile plugins were not aware of the gap which could occur, so under some circumstances they ignored what had happened between their checks. A configuration file is used to specify where to search, what to search, and what to do if a matching line is found.

[Sep 2, 2008] nagstamon 0.5.10 by Nagiostray

About: Nagstamon is a Nagios status monitor with a UI that resides in the GNOME systray or on the Windows desktop. It informs you in realtime about the status of your Nagios monitored network.

Changes: This release fixes a problem with passwords containing special characters, and an issue where it omitted showing failed services on hosts in scheduled downtime.

[Jun 25, 2008] check_oracle_health

About: check_oracle_health is a plugin for the Nagios monitoring software that allows you to monitor various metrics of an Oracle database. It includes connection time, SGA data buffer hit ratio, SGA library cache hit ratio, SGA dictionary cache hit ratio, SGA shared pool free, PGA in memory sort ratio, tablespace usage, tablespace fragmentation, tablespace I/O balance, invalid objects, and many more.

Release focus: Major feature enhancements

Changes: The tablespace-usage mode now takes into account when tablespaces use autoextents. The data-buffer/library/dictionary-cache-hitratio are now more accurate. Sqlplus can now be used instead of DBD::Oracle.

[Jun 11, 2008] check_lm_sensors 3.1.0  by Matteo Corti

About: check_lm_sensors is a Nagios plugin to monitor the values of on-board sensors and hard disk temperatures on Linux systems.

Changes: The plugin now uses the standard Nagios::Plugin CPAN classes, fixing issues with embedded perl.

[May 6, 2008] check_logfiles

Perl plugin: check_logfiles is a plugin for Nagios which checks logfiles for defined patterns

check_logfiles 2.3.3 (Default)
Added: Sun, Mar 12th 2006 15:09 PDT (2 years, 1 month ago)
Updated:
Tue, May 6th 2008 10:37 PDT (today)
About:

check_logfiles is a plugin for Nagios which checks logfiles for defined patterns. It is capable of detecting logfile rotation. If you tell it how the rotated archives look, it will also examine these files. Unlike check_logfiles, traditional logfile plugins were not aware of the gap which could occur, so under some circumstances they ignored what had happened between their checks. A configuration file is used to specify where to search, what to search, and what to do if a matching line is found.

[Aug 10, 2007] {Book} Building a Monitoring Infrastructure with Nagios by David Josephsen

A short, superficial intro book (190). Killing phase from the review below: "This is the book you should pass to your manager so (s)he understands why and how an open solution like Nagios is the better choice and can be used for achieving surpassing solutions. "
Warning: Several reviews of this book looks like plants: written by the readers who has a single networking book review or just a single review.

Spot on for a well structured book with many WOW-factors,

May 17, 2007 By Nils Valentin (Tokyo, Japan) - See all my reviews
--- DISCLAIMER: This is a requested review by PTR, however any opinions expressed within the review are my personal ones. ---

Introduction - 6p

CHAPTER 1 Best Practices - 12p
CHAPTER 2 Theory of Operations - 26p
CHAPTER 3 Installing Nagios - 11p
CHAPTER 4 Configuring Nagios - 23p
CHAPTER 5 Bootstrapping the Configs - 10p
CHAPTER 6 Watching - 46p
CHAPTER 7 Visualization - 42p
CHAPTER 8 Nagios Event Broker Interface - 19p
APPENDIX A Configure Options - 3p
APPENDIX B nagios.cfg and cgi.cfg - 9p
APPENDIX C Command-Line Options - 10p
Index - 14p

The book is with 190 pages (230p. when including appendix and index) very compact. It teaches you Nagios in a way I have never heard / read before. I must assume that the authors clear structured style - which runs through the book like a red line - must be responsible for the excellent outcome.

The book starts in the introduction with the title "Do it right the first time" and that hits it right on the spot. What make out the features of this little portable knowledgebase is the exceptional well thought through contents and its explanations by the author. David is not filling pages by explaining each and every parameter, but rather showing you the big picture, and explaining how to approach new issues or how one technical solution is better over another.

This is the book you should pass to your manager so (s)he understands why and how an open solution like Nagios is the better choice and can be used for achieving surpassing solutions.

The book itself basically is divided in two sections:

Background, setup and configuration - Chapters 1-5
Advanced Topics - Chapters 6-8

I did find any of the chapters to have a nice balance of the amount of information needed but some EXCEPTIONAL good parts of book where:

Chapter 1 Best practices
Chapter 2 - the part about scheduling
Chapters 6-8 as a whole

Chapter 6 has a thorough explanations on monitoring the different OS's (especially the Windows part !!) or other applications.

Chapter 7 for its overall thoroughness of how to visualize your data to reach the next level of a better understanding of the systems / network you are monitoring.

Chapter 8 is describing a filesystem based status interface. The NEB module will write a file with its current status code for each service. I have to admit that some technical details went over my head, but I thought that was pretty cool !!


The featured points above is what I found to be exceptionally good and most likely the strongest sales points for this little portable knowledgebase. That doesnt mean that the other not mentioned parts of the book are weak, mind you.

Funny enough the above mentioned points where EXACTLY the points which I haven't seen explained this thorough anywhere before.

So David's book was exactly spot on for me.

Summary:

To sum it all up in very simple words: This is a hell of a book !!

Its the most compact, well structured book on Nagios that I have seen to date. It contains many WOW-factors. While reading each chapter you can virtually "feel" how Davids explanations and tips and tricks already helped you to avoid time consuming pitfalls.

So this book is not about "to buy or not to buy", this is an investment you dont want to miss !!

I was especially impressed by the thoroughness the book is written by from the first page. Also the contents of the first chapter wasnt new to me, the way it was explained already provided many of those A-ha moments.

The main asset of the book is not the description of the tools itself, but rather the tought and considerations the author put into it and the sharing of those thoughts in a way that the reader can actually visualize how and why one solution is better over another, without actually having to go to the "luxury to experience the pitfalls" in a live disaster scenario.


PS: AFTER I finished reading the book I re-read the "Editorial Review" Amazon gave above and found it pretty well describing the actual book and what you should expect.

>> You can find more reviews on Nagios related books including a comparison by deploying my profile. <<

Nagios Looking Glass Getting started

With the Nagios Looking Glass (NLG) tool, developer Andy Shellam has tried to resolve a common problem for network administrators running Nagios. What happens if you want to provide access to up-to-date information from Nagios without giving users access to the full Nagios console? Providing read-only access to the Nagios console can be complicated, and can occasionally require network re-structuring or can even pose a security risk.

NLG is designed to fix those issues by taking a feed from Nagios status data via an HTTP connection and displaying it on a public Web server. It works in a client-server model with a PHP-based polling server installed on your Nagios server. A receiver client, also PHP-based, is installed on your Web server. If you want to use NLG locally, you can also run the client and the server together on your Nagios server. The receiver client creates an AJAX-enabled page based on a template. You can also customize this template to display whatever you require.

You can see a demo of NLG at http://looking-glass.andyshellam.eu/demo/.

[Jul 18, 2007] Nagios Looking Glass Getting started by James Turnbull

02.05.2007

Nagios also comes with a Web-based console, extensible Nagios Event Broker (NEB), that allows you to integrate Nagios with other tools, like database back-ends, and a large collection of monitoring commands and capabilities. It's current release, version 2.0, is stable and production ready. You can take a look at Nagios at http://www.nagios.com.

Development of Nagios has not stopped with version 2.0, though. Nagios' principal developer, Ethan Galstad, has recently released some information on the status and potential features of the next release, version 3.0. Galstad's announcement also suggests an alpha release of version 3.0 could be scheduled as early as the end of February 2007.

Features: What's new in Nagios 3.0

So what's new with version 3.0? Well, a lot. Let's walk through the major new features and look at how some of Nagios' old features have been expanded or changed.

One of the interesting features introduced in Nagios 2.0 was adaptive monitoring. Adaptive monitoring allowed a Nagios configuration to be changed during runtime. For example, you can change the command being used to check a host, based on changing conditions in your environment. In the new version, this functionality is expanded to include the ability to change the times during which checks are scheduled to occur. This allows you to turn on/off checks at specific times according to conditions in your environment.

Notifications have also been enhanced, now allowing a delay to be added to first notifications. Notifications can be generated when flapping is disabled and, most importantly, notifications can now be sent out when a scheduled downtime starts, ends or is cancelled.

Objects and templates haven't been forgotten either. One particularly useful change is the ability to use multiple templates for objects. Another is the addition of custom variables in host, service and contact objects. Version 2.0 only allows the application of one template to an object. Multiple templates offer greater flexibility and power, which will make a significant difference to the configuration of objects.

Custom variables allow you to define your own directives in object definitions and, therefore, attach additional information about an object to its definition. These variables can be retrieved and used elsewhere in your Nagios environment. For instance, you could define the SNMP community strings for a host in its definition and then use these later in a check or external command.

Other object and template changes include: merging service and host-extended information object data into service and host object definitions, and adding group member directives to the host and service group objects.

Enhancements to external commands are also present, including the ability to process commands found in an external file. The suggested use of this functionality is for passive checks with long output or complicated scripting. A further added to Nagios 3.0 is that external command checking is now turned on by default. In previous versions, such checking was set off by default.

Host and service logic alterations have also been made. Most notably, host checks now run asynchronously in parallel with each other. This should help balance overall check performance. Another enhancement is the ability to cache host and service check results and a function to enable the predictive checking of dependent hosts and services.

The ability to output multiple lines of data from host and service checks has also been added. Previously, Nagios 2.0 was limited to a single line of output from checks, thus reducing the utility of some checks. Now, multiple lines can be received and processed by Nagios and the size of plug-in output has been correspondingly increased to 2Kbs.

A number of performance optimizations have been included in Nagios 3.0, as well as enhancements to the Nagios Event Broker and the embedded Perl interpreter. Also worth mentioning are updates to macros and to status, comment and retention data.

[Jul 18, 2007] Using modules in Nagios Event Broker by James Turnbull

"The most well-known NEB module is the NDO Utilities module. The NDO Utilities module is written by Nagios' developer, Ethan Galstad, and is designed to output events and data from Nagios to standard file or a Unix socket. "

01.29.2007 | searchenterpriselinux.techtarget.com

The Nagios enterprise monitoring tool generates a variety of events. The principal events generated are the results of monitoring applications, databases, devices, services and hosts. Also generated is performance data and notification events such as outages and downtime. There are a number of ways to integrate and utilize these events. The most advanced and effective event integration mechanism is the Nagios Event Broker (NEB).

NEB uses callback routines that are executed when events occur in the Nagios server. Using NEB you can write broker modules that can process these events. NEB allows you to output and integrate events into a variety of tools including MySQL databases, SNMP traps, syslog messages or use the event data in a variety of other applications and tools.

Nagios Event Broker functions and triggers

NEB uses shared code libraries called modules that are hooked into the Nagios server when it is executed. Each module can register callback procedures that are able to receive and process events. When an event occurs, NEB checks for the presence of a registered callback and, if detected, sends the event to the module. The module receives the event and performs whatever actions are coded into it.

The broker can process a large number of events including, amongst others:

You can see a full list of the callbacks in the nebcallbacks.h include file located in the include directory of the Nagios source package.

Enabling Nagios Event Broker

NEB should be enabled by default when you compile Nagios (unless you disable it). If you want to ensure that NEB gets compiled then specify the --enable-neb configure option when configuring Nagios.

# ./configure --enable-neb

Registering modules with Nagios Event Broker

Modules are included into the Nagios configuration by using broker_module configuration options in the nagios.cfg configuration file. For example:

broker_module=/usr/local/nagios/bin/testmodule.o 

This line would load a module called testmodule.o located in the /usr/local/nagios/bin directory. You can also specify a configuration file for a module like so:

broker_module=/usr/local/nagios/bin/testmodule.o config_file=/usr/local/nagios/etc/testmodule.cfg

You need to restart Nagios for any newly defined modules to take effect.

Writing modules for Nagios Event Broker

NEB Modules can be written in C or C++. You can see an example of a module in the Nagios package. Located in the module directory off the root of the package is the helloworld module. You can create it by compiling the helloworld.c file.

# gcc -shared -o helloworld.o helloworld.c

You can then add this module to Nagios using the broker_module directive in the nagios.cfg configuration file. Restart Nagios and the module is now loaded.

The Helloworld module is extremely simple. Helloworld logs a message to the default Nagios log file when Nagios is started and stopped and when aggregated status updates start and finish. The message looks like:

[1137151111] helloworld: An aggregated status update just started.
[1137151112] helloworld: An aggregated status update just finished.

You can review the contents of this module (which includes some basic inline documentation)

Available modules for Nagios Event Broker

There are not a lot of NEB modules available, so far. The most well-known NEB module is the NDO Utilities module. The NDO Utilities module is written by Nagios' developer, Ethan Galstad, and is designed to output events and data from Nagios to standard file or a Unix socket. It also comes with a module, NDO2DB, that can write Nagios data to a MySQL or PostgreSQL database. It should provide (together with the helloworld module) a good introduction to NEB and help you get started on writing your own modules.

You can also find the following NEB modules:

  • NEB module that logs to a socket based on client requests
  • A NEB module (as yet unreleased) that does event correlation with Nagios and SEC.
  • A NEB module that helps integrate Cacti with Nagios.

    Further help with Nagios Event Broker

    There is not a lot of documentation available for NEB thus far. The only major piece of documentation available is about the NEB API. You can also review the Nagios source code relevant to NEB, particularly the include files.

    As always the Nagios development and user mailing lists are good starting places for assistance.

  • [Apr 12, 2007] Sys Admin Taming Nagios by David Josephsen

    Dec 05, 2005 ( Sys Admin)

    In the past few years, Nagios has become the industry standard open source systems monitoring tool. If you're using an open source app to monitor the availability, state, or utilization of your servers or network gear, then chances are you are using Nagios to do it. To those who have worked with it, this is no surprise. The lightweight design of Nagios offloads the actual query logic into "plug-ins", which are easily created, modified, and re-purposed by sys admins. The lack of complex query logic leaves the Nagios daemon free to manage scheduling and notifications and to handle UI.

    Nagios's "keep it simple" approach makes it straightforward to administer, network transparent, and amazingly flexible.

    Two excellent articles by Syed Ali in previous editions of Sys Admin covered the installation and configuration of Nagios. In this article, I'll pick up where those articles left off and provide some creative solutions to problems commonly faced by sys admins working with Nagios to monitor the health and performance of systems.

    [Apr 10, 2007] Nagios network monitoring felled by SNMP false alarms By Jack Loftus

    It is still unclear why false alerts were generated. Is this just a plug for Hyperic ?
    There was nothing technically wrong with the HP ProLiant servers at Mynewplace.com, an online rental services agency based in San Francisco, but the IT staff kept on getting beeped at 4 a.m. with alerts that eventually proved to be false alarms.

    So while the servers were fine, the IT staff wasn't. Entire days were being wasted each month diagnosing their clutch of 50 HP ProLiant DL145s and DL385s running Red Hat Enterprise Linux 4 AS and ES, said John Shin, Mynewplace.com's director of systems. Shin decided he needed to make some changes. .

    Struggling with network monitoring

    "We were struggling with monitoring," Shin said, but that may have been an understatement. Things were so bad, in fact, that at one point last year he contemplated disabling the monitoring application altogether because it was doing more harm than good.

    The application was Nagios, a popular open source systems and network monitoring application that provides alerts for user-defined hosts and services. In Shin's network, however, it was triggering false alarms because of simple network management protocol [SNMP] incompatibilities with Mynewplace.com's open source application server, Resin 3.0. Resin is based on a Java implementation of the PHP scripting language and is maintained and supported by San Diego-based Caucho Technology Inc.

    Nagios, JVM and Resin 3.0 woes

    Since Resin and Nagios were not directly compatible, Shin would expose the application stack's Java virtual machines (JVMs) through SNMP and monitor the environment that way. Unfortunately, response times under those conditions were sluggish, he said.

    "Nagios was not really the problem," Shin said. "It was the JVM stack not being able to respond to it correctly. It was recording events in SNMP that were then watched by Nagios and that made things crawl. There were a lot of man hours wasted, and it would trigger the 4 a.m. pages."

    In spite of its popularity on open source repositories like SourceForge.net, Nagios has its detractors. In a recent interview about Nagios with SearchEnterpriseLinux.com, Zenoss Inc. CEO Bill Karpovich criticized Nagios for its lack of enterprise-level support. "The maintainers never thought of it as a project that an IT manager would use to monitor an entire enterprise environment," he said. Zenoss is an open source startup vendor in the systems management space.

    ... ... ...

    The feature-rich, expensive offerings from HP and the other members of the "big four" – IBM, CA and BMC – have spawned the "little four" (a phrase coined by analyst firm RedMonk), comprised of Hyperic, Zenoss, Qlusters and GroundWork. Executives from those companies have bet their chips on the valuable midmarket for customer wins like Mynewplace.com.

    Compared with OpenView, offerings from the "little four" were priced approximately two-and-a-half times less on average, Shin found, although he would not cite specific dollar amounts. OpenView had another strike against it: "It did not have the framework in place to monitor some of our key applications," namely Resin and Postgres, Shin said.

    Linux Today - Linux.com Complex Service Checks with Nagios

    [Feb 05, 2007] Looking ahead to Nagios 3.0 by James Turnbull

    Nagios is a free, open source enterprise monitoring tool designed to run on Linux. It has extensive monitoring and management capabilities that allow you to check applications, databases and network devices, as well as Windows and Unix/Linux hosts and services. It is easy to install, fast to configure and highly customizable.

    Nagios also comes with a Web-based console, extensible Nagios Event Broker (NEB), that allows you to integrate Nagios with other tools, like database back-ends, and a large collection of monitoring commands and capabilities. It's current release, version 2.0, is stable and production ready. You can take a look at Nagios at http://www.nagios.com.

    Development of Nagios has not stopped with version 2.0, though. Nagios' principal developer, Ethan Galstad, has recently released some information on the status and potential features of the next release, version 3.0. Galstad's announcement also suggests an alpha release of version 3.0 could be scheduled as early as the end of February 2007.

    Features: What's new in Nagios 3.0

    So what's new with version 3.0? Well, a lot. Let's walk through the major new features and look at how some of Nagios' old features have been expanded or changed.

    One of the interesting features introduced in Nagios 2.0 was adaptive monitoring. Adaptive monitoring allowed a Nagios configuration to be changed during runtime. For example, you can change the command being used to check a host, based on changing conditions in your environment. In the new version, this functionality is expanded to include the ability to change the times during which checks are scheduled to occur. This allows you to turn on/off checks at specific times according to conditions in your environment.

    Notifications have also been enhanced, now allowing a delay to be added to first notifications. Notifications can be generated when flapping is disabled and, most importantly, notifications can now be sent out when a scheduled downtime starts, ends or is cancelled.

    Objects and templates haven't been forgotten either. One particularly useful change is the ability to use multiple templates for objects. Another is the addition of custom variables in host, service and contact objects. Version 2.0 only allows the application of one template to an object. Multiple templates offer greater flexibility and power, which will make a significant difference to the configuration of objects.

    Custom variables allow you to define your own directives in object definitions and, therefore, attach additional information about an object to its definition. These variables can be retrieved and used elsewhere in your Nagios environment. For instance, you could define the SNMP community strings for a host in its definition and then use these later in a check or external command.

    Other object and template changes include: merging service and host-extended information object data into service and host object definitions, and adding group member directives to the host and service group objects.

    Enhancements to external commands are also present, including the ability to process commands found in an external file. The suggested use of this functionality is for passive checks with long output or complicated scripting. A further added to Nagios 3.0 is that external command checking is now turned on by default. In previous versions, such checking was set off by default.

    Host and service logic alterations have also been made. Most notably, host checks now run asynchronously in parallel with each other. This should help balance overall check performance. Another enhancement is the ability to cache host and service check results and a function to enable the predictive checking of dependent hosts and services.

    The ability to output multiple lines of data from host and service checks has also been added. Previously, Nagios 2.0 was limited to a single line of output from checks, thus reducing the utility of some checks. Now, multiple lines can be received and processed by Nagios and the size of plug-in output has been correspondingly increased to 2Kbs.

    A number of performance optimizations have been included in Nagios 3.0, as well as enhancements to the Nagios Event Broker and the embedded Perl interpreter. Also worth mentioning are updates to macros and to status, comment and retention data.

    To see a full list of the changes, or if you wish to try Nagios 3.0 before its alpha release, you can download a current CVS snapshot from http://www.nagios.org/development/cvs.php . The Changelog file contained in the snapshot provides a reasonably full list of the proposed changes.

    [Feb 06, 2007] SearchOpenSource: Nagios Looking Glass: Getting Started


    Notes:
    • This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Some amount of grammar and spelling errors should be expected.
    • The site contain some broken links as it develops like a living tree... Please try to use Google, Open directory, etc. to find a replacement link (see HOWTO search the WEB for details). We would appreciate if you can mail us a correct link.
    Google Search
    Open directory

    Research Index


    Recommended Links


    In case of broken links please try to use Google search. If you find the page please notify us about new location
    Google     

    Articles


    NSCA Daemon

    HowToContactNagios - Munin - Trac

    Munin integrates perfectly with Nagios. There are, however, a few things of which to take notice. This article shows example configurations and explains the communication between the systems.

    Receiving messages in Nagios

    First you need a way for Nagios to accept messages from Munin. Nagios has exactly such a thing, namely the NSCA which is documented here: http://nagios.sourceforge.net/docs/1_0/addons.html#nsca.

    NSCA consists of a client (a binary usually named send_nsca and a server usually run from inetd. We recommend that you enable encryption on NSCA communication.

    You also need to configure Nagios to accept messages via NSCA. NSCA is, unfortunately, not very well documented in Nagios' official documentation. We'll cover writing the needed service check configuration further down in this document.

    Configuring Nagios

    In the main config file, make sure that the command_file directive is set and that it works. See http://nagios.sourceforge.net/docs/2_0/configmain.html#command_file for details.

    Below is a sample extract from nagios.cfg:

    command_file=/var/run/nagios/nagios.cmd
    

    The /var/run/nagios directory is owned by the user nagios runs as. The nagios.cmd is a named pipe on which Nagios accepts external input.

    Configuring NSCA, server side

    NSCA is run through (x)inetd. Using inetd, the below line enables NSCA listening on port 5667:

    5667            stream  tcp     nowait  nagios  /usr/sbin/tcpd  /usr/sbin/nsca -c /etc/nsca.cfg --inetd
    

    Using xinetd, the blow line enables NSCA listening on port 5667, allowing connections only from the local host:

    # description: NSCA (Nagios Service Check Acceptor)
    service nsca
    {
     flags           = REUSE
     type		 = UNLISTED
     port		 = 5667
     socket_type     = stream
     wait            = no
    
     server          = /usr/sbin/nsca
     server_args     = -c /etc/nagios/nsca.cfg --inetd
     user            = nagios
     group           = nagios
    
     log_on_failure  += USERID
    
     only_from       = 127.0.0.1
    }
    

    The file /etc/nsca.cfg defines how NSCA behaves. Check in particular the nsca_user and command_file directives, these should correspond to the file permissions and the location of the named pipe described in nagios.cfg.

    nsca_user=nagios
    command_file=/var/run/nagios/nagios.cmd
    

    Configuring NSCA, client side

    The NSCA client is a binary that submits to an NSCA server whatever it received as arguments. Its behaviour is controlled by the file /etc/send_nsca.cfg, which mainly controls encryption.

    You should now be able to test the communication between the NSCA client and the NSCA server, and consequently whether Nagios picks up the message. NSCA requires a defined format for messages. For service checks, it's like this: <host_name>[tab]<svc_description>[tab]<return_code>[tab]<plugin_output>[newline]

    Below is shown how to test NSCA.

    $ /usr/sbin/send_nsca -H localhost -c /etc/send_nsca.cfg
    foo.example.com  test   0       0
    1 data packet(s) sent to host successfully.
    

    This caused the following to appear in /var/log/nagios/nagios.log:

    [1159868622] Warning:  Message queue contained results for service 'test' on host 'foo.example.com'.  The service could not be found!
    

    Messages are sent by munin-limits based on the state of a monitored data source: OK, Warning and Critical. Munin does not currently support a Unknown state (This will be fixed in the future, see Ticket 29 for more information).

    Configuring munin.conf

    Nagios uses the above mentioned send_nsca binary to send messages to Nagios. In /etc/munin/munin.conf, enter this:

    contacts nagios
    contact.nagios.command /usr/bin/send_nsca -H your.nagios-host.here -c /etc/send_nsca.cfg
    
      Be aware that the -H switch to send_nsca appeared sometime after send_nsca version 2.1. Always check send_nsca --help!

    Configuring Munin plugins

    Lots of Munin plugins have (hopefully reasonable) values for Warning and Critical levels. To set or override these, you can change the values in munin.conf.

    Configuring Nagios services

    Now Nagios needs to recognize the messages from Munin as messages about services it monitors. To accomplish this, every message Munin sends to Nagios requires a matching (passive) service defined or Nagios will ignore the message (but it will log that something tried).

    A passive service is defined through these directives in the proper Nagios configuration file:

    active_checks_enabled           0
    passive_checks_enabled          1
    

    A working solution is to create a template for passive services, like the one below:

    define service {
            name                            passive-service
            active_checks_enabled           0
            passive_checks_enabled          1
            parallelize_check               1
            notifications_enabled           1
            event_handler_enabled           1
            register                        0
            is_volatile                     1
    }

    When the template is registered, each Munin plugin should be registered as per below:

    define service {
            use                             passive-service
            host_name                       foo
            service_description             bar
            check_period                    24x7
            max_check_attempts              3
            normal_check_interval           3
            retry_check_interval            1
            contact_groups                  linux-admins
            notification_interval           120
            notification_period             24x7
            notification_options            w,u,c,r
            check_command                   check_dummy!0
    }
    

    Notes

    • host_name is either the FQDN of the host_name registered to the Nagios plugin, or the host alias corresponding to Munin's

    NSCA_Setup

    nagios-yum-cfengine

    Books

    Nagios: System and Network Monitoring

    Best for Nagios admins who want specific details on plug-ins, September 4, 2006
    By Richard Bejtlich "TaoSecurity.com" (Washington, DC) - See all my reviews
    (TOP 500 REVIEWER)    (REAL NAME)   
    I recently received review copies of Pro Nagios 2.0 (PN2) by James Turnbull and Nagios: System and Network Monitoring (NSANM) by Wolfgang Barth. I read PN2 first, then NSANM. Both are excellent books, but I expect potential readers want to know which is best for them. The following is a radical simplification, and I could honestly recommend readers buy either (or both) books. If you are completely new to Nagios and want a very well-organized introduction, I recommend PN2. If you are somewhat familiar with Nagios and want detailed descriptions of a wide variety of Nagios plug-ins, I recommend NSANM.

    NSANM strengths lie in the depth of coverage of certain elements when compared to PN2. PN2 devotes 7 pages to host checks, while NSANM's Ch 7 offers 21 pages. PN2 supplies 8 pages on service checks, but NSANM's Ch 6 gives 46 pages. This level of detail can be very useful. For example, NSANM's explanation of check_squid also shows to to configure Sguid to allow access to its cache manager.

    NSANM shares more information on certain background protocols like SNMP. PN2's SNMP section is about 7 pages, whereas NSANM's Ch 11 is 36 pages. NSANM demonstrates more aspects of Nagios' Web interface and the CGI programs generating pages. I thought author Wolfgang Barth made very effective use of diagrams, like the network topology explanation in Ch 4, the service checks in Ch 5, and notification in Ch 12.

    NSANM includes some material not mentioned in PN2, like using Nagios with Cygwin. Sometimes the books are very complementary, as shown by PN2's discussion of NSClient++ and NSANM's overview of NSClient and NC_Net.

    NSANM is lacking coverage of security, redundancy, and failover, however. PN2 does address these critical issues. Beware the some of the "chapters" in NSANM are very short -- like Ch 8 (2 pages!) and Ch 19 (barely 6 pages). I think short sections like those should have been integrated into longer chapters or moved into the appendices.

    Overall, NSANM is a very good book. I believe new Nagios readers should read PN2, and strongly consider NSANM as a complementary reference volume.

    Pro Nagios 2.0 (Expert's Voice in Open Source)

    Building a Monitoring Infrastructure with Nagios by David Josephsen

    A short, superficial into book (190). Killing phaze from the review below: "This is the book you should pass to your manager so (s)he understands why and how an open solution like Nagios is the better choice and can be used for achieving surpassing solutions. "
    Warning: Several reviews of this book looks like plants: written by the author who has a single networking book review or just a single review.
    Spot on for a well structured book with many WOW-factors,

    May 17, 2007 By  Nils Valentin (Tokyo, Japan) - See all my reviews
    (REAL NAME)    --- DISCLAIMER: This is a requested review by PTR, however any opinions expressed within the review are my personal ones. ---

    Introduction - 6p

    CHAPTER 1 Best Practices - 12p
    CHAPTER 2 Theory of Operations - 26p
    CHAPTER 3 Installing Nagios - 11p
    CHAPTER 4 Configuring Nagios - 23p
    CHAPTER 5 Bootstrapping the Configs - 10p
    CHAPTER 6 Watching - 46p
    CHAPTER 7 Visualization - 42p
    CHAPTER 8 Nagios Event Broker Interface - 19p
    APPENDIX A Configure Options - 3p
    APPENDIX B nagios.cfg and cgi.cfg - 9p
    APPENDIX C Command-Line Options - 10p
    Index - 14p

    The book is with 190 pages (230p. when including appendix and index) very compact. It teaches you Nagios in a way I have never heard / read before. I must assume that the authors clear structured style - which runs through the book like a red line - must be responsible for the excellent outcome.

    The book starts in the introduction with the title "Do it right the first time" and that hits it right on the spot. What make out the features of this little portable knowledgebase is the exceptional well thought through contents and its explanations by the author. David is not filling pages by explaining each and every parameter, but rather showing you the big picture, and explaining how to approach new issues or how one technical solution is better over another.

    This is the book you should pass to your manager so (s)he understands why and how an open solution like Nagios is the better choice and can be used for achieving surpassing solutions.

    The book itself basically is divided in two sections:

    Background, setup and configuration - Chapters 1-5
    Advanced Topics - Chapters 6-8

    I did find any of the chapters to have a nice balance of the amount of information needed but some EXCEPTIONAL good parts of book where:

    Chapter 1 Best practices
    Chapter 2 - the part about scheduling
    Chapters 6-8 as a whole

    Chapter 6 has a thorough explanations on monitoring the different OS's (especially the Windows part !!) or other applications.

    Chapter 7 for its overall thoroughness of how to visualize your data to reach the next level of a better understanding of the systems / network you are monitoring.

    Chapter 8 is describing a filesystem based status interface. The NEB module will write a file with its current status code for each service. I have to admit that some technical details went over my head, but I thought that was pretty cool !!


    The featured points above is what I found to be exceptionally good and most likely the strongest sales points for this little portable knowledgebase. That doesnt mean that the other not mentioned parts of the book are weak, mind you.

    Funny enough the above mentioned points where EXACTLY the points which I haven't seen explained this thorough anywhere before.

    So David's book was exactly spot on for me.

    Summary:

    To sum it all up in very simple words: This is a hell of a book !!

    Its the most compact, well structured book on Nagios that I have seen to date. It contains many WOW-factors. While reading each chapter you can virtually "feel" how Davids explanations and tips and tricks already helped you to avoid time consuming pitfalls.

    So this book is not about "to buy or not to buy", this is an investment you dont want to miss !!

    I was especially impressed by the thoroughness the book is written by from the first page. Also the contents of the first chapter wasnt new to me, the way it was explained already provided many of those A-ha moments.

    The main asset of the book is not the description of the tools itself, but rather the tought and considerations the author put into it and the sharing of those thoughts in a way that the reader can actually visualize how and why one solution is better over another, without actually having to go to the "luxury to experience the pitfalls" in a live disaster scenario.


    PS: AFTER I finished reading the book I re-read the "Editorial Review" Amazon gave above and found it pretty well describing the actual book and what you should expect.

    >> You can find more reviews on Nagios related books including a comparison by deploying my profile. <<



    Copyright © 1996-2009 by Dr. Nikolai Bezroukov. www.softpanorama.org was created as a service to the UN Sustainable Development Networking Programme (SDNP) in the author free time. Submit comments This document is an industrial compilation designed and created exclusively for educational use and is placed under the copyright of the Open Content License(OPL). Site uses AdSense so you need to be aware of Google privacy policy. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

    Disclaimer:

    Last modified: October 18, 2009