|
Softpanorama
(slightly skeptical)
Open Source Software Educational Society |
May the
source be with you,
but remember the KISS principle ;-)
|
Nagios in Large Enterprise Environment
Nagios was formerly
known as Netsaint. It was written by Ethan Galstad approximately ten years
ago. It looks like the author has no previous experience with commercial
monitoring systems.
Functionally Nagios in not much more them a simple daemon implementing
probes scheduling. It is written in C . Generally it is agentless
solution (like SiteScope) but functionality for using SSH and telnet is
very basic. Initially it was designed to monitor the host it was running
in plus network services. In a way Nagios is still the most suitable for
network monitoring or monitoring a limited number of servers via SSH (or
rsh or telnet).
By default the Nagios plugins
that perform various checks and measurements run on the Nagios server, not
on monitored resources (scheduled to run on localhost).
Later Nagios added so called the Nagios Remote Plug-in Executor (NRPE)
by in terms of architecture and functionality, it is not competitive with
well engineered agents.
They can generated messages which are forward to Web interface, SMTP
mail and similar services. Attempts to extend Nagios to monitoring
multiple servers are very clumsy. There is no nothing like concept of access
path to host or host access method. Facilities provided (check_by_ssh
and NPRE) both smell like cheap hacks. For example, here is no way
to associate host of host groups with the particular access method for probes
in Nagios configuration. For problems in using Nagios in large enterprise
environment see
Deploying
Nagios in a Large Enterprise Environment, at USENIX LISA '07
For more application oriented monitoring system see OpenSMART by
Ulrich Herbst
Nagios impose very few limitation on the structure and communication
of probes (plug-ins
in Nagios terminology) . They can be any legitimate Unix executable
written in any language (shell, Perl, C, etc). After execution Nagios grabs
the first line of text from STDOUT. It also capture plug-in return code:
| Numeric Value |
Service Status |
Status Description |
| 0 |
OK |
The plugin was able to check the service and it appeared
to be functioning properly |
| 1 |
Warning |
The plugin was able to check the service, but it appeared
to be above some "warning" threshold or did not appear to be
working properly |
| 2 |
Critical |
The plugin detected that either the service was not running
or it was above some "critical" threshold |
| 3 |
Unknown |
Invalid command line arguments were supplied to the plugin
or low-level failures internal to the plugin (such as unable
to fork, or open a tcp socket) that prevent it from performing
the specified operation. Higher-level errors (such as name resolution
errors, socket timeouts, etc) are outside of the control of
plugins and should generally NOT be reported as UNKNOWN states.
|
Result are communicated via Nagios environment which consists of multiple
macros. Each macro is essentially an environment variable that is populated
by Nagios and its value is inserted into the probe invocation string, for
example
define command{
command_name check_ping
command_line /usr/local/nagios/libexec/check_ping
-H $HOSTADDRESS$ -w 100.0,90% -c 200.0,60%
}
Here $HOSTADDRESS$ is a
macro populated by Nagios. Actually, starting with Nagios 2.0, most macros
have been made available as environment variables.
The ability to use services like SSH and telnet for communication with
remote host are not built-in into the system and you need to specify each
such probe in configuration file. Another possibility to write
custom probes envelope in some scripting language, for example Perl that
will detect which host should be communicated with which method and them
run probe accordingly. In general for non-trivial enterprise network
you probably need to program your own probes envelope or hire somebody to
do this.
Configuration of Nagios is rather verbose and should generally be generated
with macro generator like M4 or you will hate Nagios from day one :-).
The structure of the events is very primitive. Structured fields are
essentially limited to name of the host and severity. everything else need
to interpreted from the test of the message.
The only interesting architectural feature of Nagios that I have
found is the concept of adaptive monitoring.
Adaptive monitoring allowed a Nagios configuration to be changed during
runtime. Currently it is limited to the ability to change
the interval between checks and times during which checks are scheduled
to occur. This allows you to turn on/off checks at specific times according
to conditions in your environment. Not very impressive capability but a
step in right direction.
Nagios has the ability to notify contacts (via email, pager or other
methods) when problems arise and are resolved. This is handles by
communication modules. some of them (for SMTP) are provided in the distribution.
Web interface is provided. The information that Nagios collects is displayed
in a set of automatically updated Web pages. Several CGIs are included
in order to allow you to view the current and historical status via a Web
browser. WAP interface is also provided to allow you to acknowledge problems
and disable notifications from an internet-ready cellphone. The narrow column
on the left of the display lists links to all of the possible Nagios web
pages (the one for the current page has been highlighted in the illustration).
The Tactical Overview page shows general statistics about the overall
monitored infrastructure status like the number of hosts which are down,
unreachable, etc. The display also indicates that number of services
in "critical" status (probably indicating a failure), as well as other states.
Each of the problem indicator displays also functions as a link to another
Web page giving details about that particular item.
Nagios prepackaged modules (see
Configuring Nagios
Commands) can monitors a pretty wide variety of system properties, including
standard system performance metrics such as load average and free disk space;
the presence of important services like HTTP and SMTP as well as host network
availability and reachability. It also allows the system administrator to
define what constitutes a significant event on each host--for example, how
high a load average is "too high"--and what to do when such conditions are
detected.
In addition to detecting problems with hosts and their important services,
Nagios also allows the system administrator to specify what should be done
as a result. A problem can trigger an alert to be sent to a designated recipient
via various communication mechanisms (such as email, Unix message, pager).
It is also possible to define an event handler: a program that is
run when a problem is detected. Such programs can attempt to solve the problem
encountered, and they can also proactively prevent some serious problems
when they get triggered by warning conditions.
Available actions in the Nagios Host Information display
| Item |
Meaning |
| Disable checks of this host |
Stop monitoring this host for availability. |
| Acknowledge this host problem |
Respond to a current problem (discussed
below). |
| Disable notifications for this host |
Don't send alerts if this host is unavailable. |
| Delay next host notification |
Delay the next alert for host unavailability. |
| Schedule downtime for this host. Cancel scheduled
downtime for this host |
Define or cancel schedule downtime. During
downtime, host unavailability is not considered a problem |
| Disable notifications for all services on
this host. Enable notifications for all services on this host. |
Don't/do send alerts if a service on this
host fails. |
| Schedule an immediate check of all services
on this host |
Check all services as soon as possible
(rather than waiting for their next scheduled time). |
Disable checks of all services on this host
Enable checks of all services on this host |
Disable or enable checking service health
on this host. |
| Disable event handler for this host |
Prevent the event handler from running
when a problem is detected on this host. |
| Disable flap detection for this host |
Don't try to detect flaps (rapid up-down
or on-off oscillations) on this host or its services. |
The second menu item allows you to acknowledge any current problem. Acknowledging
simply means "I know about the problem, and it is being handled." Nagios
marks the corresponding event as such, and future alerts are suppressed
until the item returns to its normal state. This process also allows you
to enter a comment explaining the situation, an action that is helpful when
more than one administrator regularly examines the monitoring data.
If you don't like all of these table-oriented status displays, Nagios
also has the capability to use graphical ones. See the
Nagios Web site for
example screen shots.
Configuring Nagios
Configuring Nagios is a punishment and is time consuming abd boring.
Sometimes question arise about inclinations of the designers toward red
tape. Some of those deficiencies are connected with the fact that the system
was programmed in C. Nagios uses the half-dozen of configuration files.
Among them
- nagios.cfg: This is the main Nagios configuration file, containing
global settings for the package. It defines directory locations for
the package's various components, the user and group context for the
daemon, what items to log, log file rotation settings, various time-outs
and other performance-related settings, and additional items related
to some of the package's advanced features (such as enabling event handling
and defining global event handlers).
- Object configuration files: This class of files specifies
which hosts and services are monitored. In addition, they can be used
to define host and service test commands, host groups, alerts and their
recipients, event handlers, and other object-specific settings used
by Nagios.
- cgi.cfg: This file holds settings related to the Nagios displays,
including paths to Web page items and scripts, and per-item icon and
sound selections. The file also defines allowed access to Nagios's data
and commands.
- resource.cfg: This file defines macros that may be used within
other settings for clarity and security purposes, such as to hide passwords
from view in CGI programs.
The package provides sample starter versions of all of these file. We
will consider some aspects of these file types in the remainder of this
article.
Nagios configuration files are generally stored in /usr/local/nagios/etc
The nagios.cfg File
This configuration file contains directives that apply to the entire
Nagios monitoring system. Here is an annotated sample version illustrating
some of its most important features:
# File locations
log_file=/var/log/nagios.log
cfg_file=/etc/opt/nagios/checkcommands.cfg
cfg_file=/etc/opt/nagios/misccommands.cfg
cfg_file=/etc/opt/nagios/hosts.cfg
resource_/etc/opt/nagios/resource.cfg
lock_file=/var/run/nagios.lock
...
The first part of the configuration file specifies various file locations,
including the general log file, files holding service check command and
notification and event handler command definitions (checkcommands
and misccommands). Other cfg_file directives are
used by the administrator to specify the object definition files in use
at that site (indicated by the one in red). Locations for other types of
files follow. The lock file holds the PID of the current nagios process.
# Logging settings
log_rotation_method=d
log_archive_path=/var/log/nagios
use_syslog=1
log_host_retries=1
log_event_handlers=1
...
These directives specify logging settings, including how often logs are
rotated (here, daily), the archive directory for old files, whether to log
significant problems to syslog as well, and whether to log individual event
types.
# Global settings
nagios_user=nagios
nagios_group=nagios
date_format=us
admin_email=nagadmin
admin_pager=19995551212
These lines specify various global settings, including the user/group
as which the nagios daemon runs, the output format for dates (here, US style),
and the administrator's email address. The final item sets the value of
the $ADMINPAGER$ macro, which can be used in command definitions.
# Package-wide event handlers
enable_event_handlers=1
global_host_event_handler=global-event-command
global_service_event_handler=global-svc-command
Settings related to event handlers. You can optionally define a single
event handler for all host failures and service failures in this file if
appropriate. Commands are defined in an object configuration file.
# Concurrent checks and time-outs
max_concurrent_checks=0
service_check_timeout=60
host_check_timeout=30
event_handler_timeout=30
notification_timeout=30
...
These directives control the number of maximum checks that can be made
at the same time (0 means an unlimited number), as well as time-outs for
various types of commands (values in seconds).
# Retained status information
retain_state_information=1
retention_update_interval=60
use_retained_program_state=1
These lines tell Nagios to retain information about host and service
status between sessions, saving the values every 60 seconds, and reloading
them when the facility starts up.
# Passive service checks
accept_passive_service_checks=1
check_service_freshness=1
These directives enable "passive checks": status data produced by external
commands which Nagios imports periodically.
# Save Nagios data for later use
process_performance_data=1
host_perfdata_command=process-host-perfdata
service_perfdata_command=process-service-perfdata
These directives allow you to save Nagios data externally for long term
analysis or other purposes. The commands specified here must be defined
in some object configuration file. The simplest such command simply writes
the command's output to an external file: e.g., echo $OUTPUT$ >>
file, but you can perform whatever action is appropriate (e.g., send the
data to an RRDTool or other database).
Note that the directives appear in a slightly different order in the
sample nagios.cfg file provided with the package.
Object Configuration Files
The bulk of Nagios configuration occurs in the object configuration files.
These files define hosts and services to be monitored, how various status
conditions should be interpreted, and what actions should be taken when
they occur. These files are used to define the following items:
- Hosts: Computers and other network
devices
- Host Groups: Named groups of
hosts
- Services: Important daemons
providing specific network services
- Contacts: User to be contacted
in the event of a problem
- Contact Groups: Named groups
of contacts
- Time Periods: Day and/or time
ranges within a week, used to specify when checks are to be performed,
notifications are to be sent, and the like
- Commands: Commands to be run
for all purposes (host/service checking, notifications, event handling,
and so on). Nagios provides two files containing many predefined commands:
checkcommands.cfg and misccommands.cfg.
- Host Dependencies: Specifications of host reachability dependencies.
When an intermediate host is down, checks are skipped for all hosts
that are dependent on that one.
- Service Dependencies: Specifications of service dependency
requirements. When a service host is down, checks are skipped for all
other services that are dependent on it.
- Host Escalations: Definitions of optional escalation levels
for host problems
- Host Group Escalations: Definitions of optional escalation
levels for host groups
- Service Escalations: Definitions of optional escalation levels
for failed services
The items in red will need to be defined for virtually every Nagios installation;
the ones in black are optional. In the sample Nagios configuration provided
with the package, each type of object is defined in a separate configuration
file (named after the object type, excluding any spaces). However, you can
arrange your definitions in any form that makes sense to you.
Hosts and Host Groups
All of these items are defined via templates: named sets of attributes
and settings that can be easily applied to any number of actual objects.
For example, here is a template definition for hosts:
define host{
; Template name
name normal
; This is only a template (not a real host)
register 0
; Host notifications are enabled
notifications_enabled 1
; Command to check if host is available
check_command check-host-alive
; Recheck failures this many times
max_check_attempts 1
; Repeat failure notifications every 2 hours
notification_interval 120
; When to check (time period name)
notification_period 24x7
; Notify when down, unreachable and on recovery
notification_options d,u,r
; Host event handler is enabled
event_handler_enabled 1
; Event handler command (defined elsewhere)
event_handler host-eh
; Flap detection is disabled
flap_detection_enabled 0
; Save performance data
process_perf_data 1
; Save status information across restarts
retain_status_information 1
}
This template defines a variety of host-monitoring settings (which are
explained in the comments following the semicolons). Here is a host definition
that uses this template:
define host{
; Template on which to base host
use normal
; Note the attribute is not "name" as above
host_name beulah
; Longer description
alias beulah: SuSE 8.1
; IP address
address 192.168.1.44
; Overrides template value
max_check_attempts 8
}
Other hosts may be defined in a similar way. Host definitions themselves
can also be used as templates, provided that a name attribute
is included.
Once hosts have been defined, they may be placed into host groups via
directives like this one:
define hostgroup{
hostgroup_name bldg2
alias Building 2
contact_groups admins1
members beulah,callisto,ariadne,leah,lovelace,valley
}
This definition creates the host group named bldg2, consisting
of six hosts (all previously defined via define host directives). The
contact_groups attribute specifies who to send notifications
to, and it is defined elsewhere (as we'll see).
You can use as many host groups as you want to. Hosts can be part of
multiple host groups, and host groups themselves may be nested.
Services
Here are two service templates and a service definition:
define service{ ; Define defaults for all services
name generic
register 0
; Check service every 30 minutes
normal_check_interval 30
; Retry failing checks every 3 minutes, up to 5 times
retry_check_interval 3
max_check_attempts 5
event_handler_enabled 1
check_period 24x7
; Repeat notifications for failures every 2 hours
notification_interval 120
notification_period 6to22
; Notify contacts about critical failures/recoveries
notification_options c,r
notifications_enabled 1
contact_groups admins
}
define service{ ; Define the SMTP service
use generic
name generic-smtp
register 0
service_description Check SMTP
check_command check_smtp
event_handler eh_smtp
contact_groups mailadmins
}
define service{ ; Define services to be monitored
use generic-SMTP
; Monitor SMTP for all hosts in this host group
host_groups mailhosts
}
The first template (generic) defines some settings, which can be applied
to a variety of service types. The second template, generic-SMTP, uses the
first template as a starting point and adds to them in order to create a
generic SMTP monitoring service. Specifically, it defines a check command,
an event handler, and a contact group that are appropriate for the SMTP
service. The final define service stanza sets up SMTP monitoring for all
of the hosts in the mailhosts host group.
Contacts and Contact Groups
Here are two stanzas defining a contact and a contact group:
define contact{
contact_name nagadmin
alias Nagios Admin
; When to notify about service problems
service_notification_period 6to22
; When to notify about host problems
host_notification_period 24x7
; Notify on critical problems and recoveries
service_notification_options c,r
; Notify on host down and recoveries
host_notification_options d,r
service_notification_commands notify-by-email
host_notification_commands host-notify-by-epager
email nagios-admins@ahania.com
pager $ADMINPAGER$
}
define contactgroup{
contactgroup_name mailadmins
alias Mail Admins
members mailadm,chavez,catfemme
}
The first stanza defines a contact named nagadmin. It also
defines what events to notify this contact about and the time periods during
which notifications should be sent. The commands to use to generate the
alerts are also specified, along with arguments to them (see below).
Time Periods
Time period definitions are quite simple. Here are the definitions of
the two time periods we have used so far:
define timeperiod{
timeperiod_name 24x7
alias 24 Hours A Day, 7 Days A Week
sunday 00:00-24:00
monday 00:00-24:00
tuesday 00:00-24:00
wednesday 00:00-24:00
thursday 00:00-24:00
friday 00:00-24:00
saturday 00:00-24:00
}
define timeperiod{
timeperiod_name 6to22
alias Weekdays, 6 AM to 10 PM
Monday 06:00-22:00
Tuesday 06:00-22:00
Wednesday 06:00-22:00
Thursday 06:00-22:00
Friday 06:00-22:00
}
Note that only the applicable days need be included in the definition.
Commands
The commands referred to in many of the preceding object definitions
also must be defined. For example, here is the SMTP service check command
definition:
define command{
command_name check_smtp
command_line $USER1$/check_smtp -H $HOSTADDRESS$
}
This command runs the check_smtp script stored in the directory
defined in the macro $USER1$ (defined in the resource.cfg file--see
below); this macro conventionally holds the path to the Nagios plug-ins
directory. The command is passed the option -H, followed by
the IP address of the host to be checked (the latter is expanded from the
built-in $HOSTADDRESS$ macro).
You can determine the syntax for any plug-in by running it with the
--help option. You can also extend Nagios by adding custom
plug-ins of your own. See the documentation for details on how to accomplish
this.
Event handlers are defined in the same way, as in this example:
define command{
command_name eh_smtp
command_line /usr/local/nagios/eh/fix_mail $HOSTADDRESS$ $STATETYPE$
}
Here, we define the command named eh_smtp. It specifies
the full path to a program to run, passing two arguments: the host's IP
address and the value of the $STATETYPE$ macro. This item is
set to HARD for critical failures and SOFT for warnings.
Here are the definitions of commands used for notifications (we've wrapped
the command_line setting for clarity):
define command{
command_name notify-by-email
command_line /usr/bin/printf "%b" "***** Nagios 1.0 *****\n\n
Notification Type: $NOTIFICATIONTYPE$\n\n
Service: $SERVICEDESC$\n
Host: $HOSTALIAS$\n
Address: $HOSTADDRESS$\n
State: $SERVICESTATE$\n\n
Date/Time: $DATETIME$\n\n
Additional Info:\n\n$OUTPUT$" |
/usr/bin/mail -s "** $NOTIFICATIONTYPE$
alert - $HOSTALIAS$/$SERVICEDESC$
is $SERVICESTATE$ **" $CONTACTEMAIL$
}
This command constructs a simple email message using the printf
command and many built-in Nagios macros. It then sends the message using
the mail command, specifying the recipient as the $CONTACTEMAIL$
macro. The latter contains the value of the corresponding email
attribute for the host or service that is generating the alert.
The cgi.cfg File
The cgi.cfg configuration file has several different functions with the
Nagios system. Among the most important is authentication, allowing Nagios
and its data to be restricted to appropriate people. Here are some sample
directives related to authorization:
use_authentication=1
authorized_for_configuration_information=netsaintadmin,root,chavez
authorized_for_all_services=netsaintadmin,root,chavez,maresca
The first entry enables the access control mechanism. The next two entries
specify users who are allowed to view Nagios configuration information and
services status information (respectively). Note that all users also must
be authenticated to the Web server using the usual Apache htpasswd
mechanism.
This same configuration file is also used to store settings for icon-based
status displays, as in these examples:
hostextinfo[janine]=;redhat.gif;;redhat.gd2;;168,36;,,;
hostextinfo[ishtar]=;apple.gif;;apple.gd2;;125,36;,,;
These entries specify extended attributes for the hosts defined in the
entries labeled janine and ishtar. The filenames in this example specify
images files for the host in status tables (GIF format--see Figure 3) and
in the status map (GD2 format), and the two numeric values specify the device's
location--for example, x and y coordinates--within the 2D status map. (Figure
4 provides an example status map display).
The resource.cfg File
The final configuration file we will consider is the resource.cfg file.
It is used to define site-specific macros, strangely named $USER1$
through $USER32$:
# $USER1$ = path to plugins directory
$USER1$=/usr/lib/nagiosplugins
...
# Store a username and password (hidden)
$USER3$=admin
$USER4$=somepassword
The first macro defines the path to the Nagios plug-ins directory; this
usage is assumed by the supplied sample configuration files.
The other two macros are used in this case to store a username and password.
These items can be used in command definitions for added security. The resource.cfg
file itself can be protected against all non-root access without compromising
the ability of CGI programs to run successfully.
Checking a Nagios Configuration
Since Nagios configuration is somewhat involved, the package provides
a command that can be used to verify it prior to running the program. Here
is an example of its use:
# cd /etc/opt/nagios/nagios/etc
# /usr/local/nagios/bin/nagios -v nagios.cfg
This will check the Nagios configuration, which uses nagios.cfg as its
main configuration file.
External Command Pipe
November 20, 2008 | Linux.com
System monitoring tool Nagios offers
a powerful mechanism for receiving events and commands from external
applications. External commands are usually sent from event handlers
or from the Nagios Web interface. You will find external commands most
useful when writing event handlers for your system, or when writing
an external application that interacts with Nagios.
This article is excerpted from the newly published book
Learning Nagios 3.0 from
Packt
Publishing.
The external commands pipe is a pipe file created on a filesystem
that Nagios uses to receive incoming messages. The communication does
not use any authentication or authorization -- the only requirement
is to have write access to the pipe file, rw/nagios.cmd, which is located
in the directory passed as the localstatedir option during compilation.
An external command file is usually writable by the owner and the
group; the usual group used is nagioscmd. If you want a user to be able
to send commands to the Nagios daemon, simply add that user to this
group.
A small limitation of the command pipe is that there is no way to
get any results back, so it is not possible to send any query commands
to Nagios. Therefore, by just using the command pipe, you have no verification
that the command you have passed to Nagios has been processed, or will
be processed soon. It is, however, possible to read the Nagios log file
and check whether it indicates that the command has been parsed correctly.
The Nagios Web interface uses an external command pipe to control
how Nagios works. The Web interface does not use any other means to
send commands or apply changes to Nagios.
From the Nagios daemon perspective, there is no clear distinction
as to who can perform what operations. Therefore, if you plan to use
the external command pipe to allow users to submit commands remotely,
you need to make sure that authorization is in place so that unauthorized
users cannot send potentially dangerous commands to Nagios.
The syntax for formatting commands is easy. Each command must be
placed on a single line and end with a newline character. The syntax
is as follows:
[TIMESTAMP] COMMAND_NAME;argument1;argument2;...;argumentN
TIMESTAMP is written as Unix time -- that is, the number of seconds
since 1970-01-01 00:00:00. You can create this by using the date command.
Most programming languages also offer the means to get the current Unix
time.
Commands are written in upper case. The arguments depend on the actual
command. For example, to add a comment to a host stating that it has
passed a security audit, you can use the following shell command:
echo "['date +%s'] ADD_HOST_COMMENT;somehost;1;Security Audit; This
host has passed security audit on 'date +%Y-%m-%d'" >/var/nagios/rw/nagios.cmd
This will send an
ADD_HOST_COMMENT command to Nagios over the external command pipe.
Nagios will then add a comment to the host, somehost, stating that the
comment originated from Security Audit. The first argument specifies
the host name to add the comment to; the second tells Nagios if this
comment should be persistent. The next argument describes the author
of the comment, and the last argument specifies the actual comment text.
Similarly, adding a comment to a service requires the use of the
ADD_SVC_COMMENT command. The command's syntax is similar to that
of the ADD_HOST_COMMENT command except that the command requires the
specification of the host name and service name.
You can also delete a single comment or all comments using the
DEL_HOST_ COMMENT,
DEL_ALL_HOST_COMMENTS, and
DEL_SVC_COMMENT or
DEL_ALL_SVC_COMMENTS commands.
Other commands worth mentioning are related to scheduling checks
on demand. Often, it is necessary to request that a check be carried
out as soon as possible; for example, when testing a solution.
You can create a script that schedules a check of a host, all services
on that host, and a service on a different host, as follows:
#!/bin/sh NOW='date +%s' echo "[$NOW] SCHEDULE_HOST_CHECK;somehost;$NOW"
\ >/var/nagios/rw/nagios.cmd echo "[$NOW] SCHEDULE_HOST_SVC_CHECKS;somehost;$NOW"
\ >/var/nagios/rw/nagios.cmd echo "[$NOW] SCHEDULE_SVC_CHECK;otherhost;Service
Name;$NOW" \ >/var/nagios/rw/nagios.cmd exit 0
The commands
SCHEDULE_HOST_CHECK and
SCHEDULE_HOST_SVC_CHECKS accept a host name and the time at which
the check should be scheduled. The
SCHEDULE_SVC_CHECK command requires the specification of a service
description as well as the name of the host to schedule the check on.
Normal scheduled checks, such as the ones scheduled above, might
not actually take place at the time that you scheduled them. Nagios
also needs to take allowed time periods into account as well as checking
whether checks were disabled for a particular object or globally for
the entire Nagios.
There are cases when you'll need to force Nagios to do a check --
in such cases, you should use
SCHEDULE_FORCED_HOST_CHECK,
SCHEDULE_FORCED_HOST_SVC_CHECKS, and
SCHEDULE_FORCED_SVC_CHECK commands. They work in exactly the same
way as described above, but make Nagios skip the checking of time periods,
and ensure that the checks are disabled for this particular object.
This way, a check will always be performed, regardless of other Nagios
parameters.
Other commands worth using are related to custom variables, introduced
in Nagios 3. When you define a custom variable for a host, service,
or contact, you can change its value on the file with the external command
pipe.
As these variables can then be directly used by check or notification
commands and event handlers, it is possible to make other applications
or event handlers change these attributes directly without modifications
to the configuration files.
How might this work? Suppose that the IT staff registers its presence
via an application without any GUI. This application periodically sends
information about the latest known IP address, and that information
is then passed to Nagios assuming that the person is in the office.
This would later be sent to a notification command to use that specific
IP address while sending a message to the user.
Assuming that the user name is jdoe and the custom variable name
is DESKTOPIP, the message that would be sent to the Nagios external
command pipe would be as follows:
[1206096000] CHANGE_CUSTOM_CONTACT_VAR;jdoe;DESKTOPIP;12.34.56.78
This would cause a subsequent use of $_CONTACTDESKTOPIP$ to return
a value of 12.34.56.78.
Nagios offers the
CHANGE_CUSTOM_CONTACT_VAR,
CHANGE_CUSTOM_HOST_VAR, and
CHANGE_CUSTOM_ SVC_VAR commands for modifying custom variables in
contacts, hosts, and services.
The commands explained above are just a small subset of the full
capabilities of the Nagios external command pipe. For a complete list
of commands, visit
the External Command List.
January 20th, 2009
1 comment(s) If you are using Nagios to monitor
remote servers, you have more than one method to execute checks, including
the use of the check_by_ssh plugin. Vincent Danen tells you how to set
up this plugin and the best way to secure it.—————————————————————————————————
Nagios is a monitoring system that can be used to monitor a wide
variety of services and criteria. Remotely, it can monitor anything
that can be accessed remotely: Web sites, SMTP servers, FTP servers,
and so forth. Locally, it can monitor even more: load average, swap
and memory usage, disk space usage, hard drive temperatures, and the
like. In fact, Nagios’ extensible nature makes writing plugins a breeze,
so it is possible to monitor anything for which you are able to get
representable data.
Unfortunately, if you wish to monitor local resource usage on a remote
site it can be a little trickier. There are a number of ways this can
be done, from using NSCA (Nagios Service Check Acceptor) to using NRPE
(Nagios Remote Plugin Executor). These solutions may be best if you
are able to compile and install software on the other machine, but if
that is not a possibility, there are other solutions.
One such solution is to execute checks via SSH. If you are able to
access the remote machine via SSH and have the ability to run programs
out of a home directory, and the ability to set an SSH public key, then
the check_by_ssh plugin is perhaps your best bet.
The first step is to ensure that the central Nagios server is able
to connect to the remote host via SSH in a manner that does not require
a password. This would require creating a password-less public/private
keypair as the user running the Nagios service (typically “nagios”),
sending the public key to the remote server, and then (as user “nagios”)
logging into the remote system. For example:
nagios@nagiosserver:~/ > $ ssh-keygen -t dsa
Generating public/private dsa key pair.
Enter file in which to save the key (/home/nagios/.ssh/id_dsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/nagios/.ssh/id_dsa.
Your public key has been saved in /home/nagios/.ssh/id_dsa.pub.
The key fingerprint is:
6a:b4:cb:f1:7d:7b:7c:1b:c4:79:2a:5d:5a:16:da:b8 nagios@nagiosserver.com
nagios@nagiosserver:~/ > $ scp .ssh/id_dsa.pub user@remotehost.com:~/.ssh/authorized_keys
nagios@nagiosserver:~/ > $ ssh user@remotehost.com
user@remotehost:~/ > $
This creates the key without a passphrase and then copies the newly-created
id_dsa.pub public key file to the remote host. Make sure that the ~user/.ssh
directory already exists on the remote host and ensure that it is mode
0700 to protect it. If that is all correct, then using ssh to connect
to the remote site as the specified user should yield a shell prompt.
If so, then we can configure Nagios to use check_by_ssh.
One quick note: if you are able to create a dedicated account on
the remote system for this, it would be best to do so. If, on the other
hand, you are unable to, be sure to adequately protect your central
Nagios server, because if anyone can obtain privileges as “nagios” on
the central server, they will have an easy ticket to your user account
on the remote server.
As well, copying whichever plugins you wish to execute on the remote
machine into a ~/bin or ~/plugins directory would be the next step.
To step up security, you can write a wrapper script to execute those
specific commands and modify ~/.ssh/authorized_keys on the remote server
to only execute the wrapper script, which would prevent that key from
being used for anything other than executing Nagios checks.
On the central Nagios server, in the commands.cfg configuration file,
define the new checks. The example below defines a new check_ssh_load
command:
# 'check_ssh_load' command definition
define command {
command_name check_ssh_load
command_line $USER1$/check_by_ssh -H $HOSTADDRESS$ -C "/home/user/bin/check_load -w $ARG1$ -c $ARG2$"
}
This command will call the check_by_ssh plugin to connect to the
specified host (via the $HOSTADDRESS$ macro) and execute the command
/home/user/bin/check_load, which is the check_load plugin, on the remote
machine; you will need to adjust the path to match the location of that
plugin on the remote server. As well, if paths and/or usernames differ
on remote servers and you plan to monitor more than one, you may need
to define multiple commands, one for each server (or use macros).
Next, edit services.cfg and add the following:
define service {
use local-service ; check current load on machine
hostgroup_name ssh-nagios-services
service_description Current Load
check_command check_ssh_load!5.0,4.0,3.0!10.0,6.0,4.0
}
This defines a new service to execute for hosts in the ssh-nagios-services
hostgroup. It calls the defined check_ssh_load command and will put
the service in a warn state if the load average hits 5, and a critical
state if it hits 10 (adjust to suit, of course).
Finally, edit hostgroups.cfg to create the ssh-nagios-services hostgroup.
Systems added to this hostgroup will automatically begin to use the
defined service.
define hostgroup {
hostgroup_name ssh-nagios-services
alias Nagios over SSH
members remote1,remote2
}
Here we define that remote1 and remote2 both belong to this hostgroup.
As a result, both will start using the check_ssh_load command.
Using check_by_ssh is a convenient and secure way to execute Nagios
plugins on remote servers. When all you can see of the status of a remote
server is HTTP or SMTP availability, your view of the server is quite
restricted. Being able to see local resource usage can allow you to
spot problems, and correct them, before they are visible to users.
Get the
PDF version of this tip here.
Delivered each Tuesday, TechRepublic’s free Linux and Open Source
newsletter provides tips, articles, and other resources to help you
hone your Linux skills.
Automatically sign up today!
Expect to the rescue! Expect is a nifty program that automates tasks
based on input and output. You should have installed Expect on your
computer by now.Using
heavily
Googled code, I concocted my own version of the sshexpect
script:
#!/usr/bin/expect -f
#Expect script to supply root/admin password for remote ssh server
#and execute command.
#This script needs three argument to(s) connect to remote server:
#password = Password of remote UNIX server, for root user.
#ipaddr = IP Addreess of remote UNIX server, no hostname
#scriptname = Path to remote script which will execute on remote server
#For example:
#./sshlogin.exp password 192.168.1.11 who
#------------------------------------------------------------------------
#Copyright (c) 2004 nixCraft project <http://cyberciti.biz/fb/>
#This script is licensed under GNU GPL version 2.0 or above
#-------------------------------------------------------------------------
#This script is part of nixCraft shell script collection (NSSC)
#Visit http://bash.cyberciti.biz/ for more information.
#----------------------------------------------------------------------
#set Variables
set ipaddr [lrange $argv 0 0]
set password [lrange $argv 1 1]
set scriptname [lrange $argv 2 2]
set arg1 [lrange $argv 3 3]
set arg2 [lrange $argv 4 4]
set arg3 [lrange $argv 5 5]
set arg4 [lrange $argv 6 6]
set arg5 [lrange $argv 7 7]
set arg6 [lrange $argv 8 8]
set arg7 [lrange $argv 9 9]
#setting a timeout for the password prompt 5 seconds larger than the SSH ConnectionTimeout parameter
set timeout 35
#now connect to remote UNIX box (ipaddr) with given script to execute
set pid [spawn -noecho ssh -o "ConnectTimeout 30" -o "CheckHostIP no" -o "StrictHostKeyChecking no" $ipaddr $scriptname $arg1 $arg2 $arg3 $arg4 $arg5 $arg6 $arg7]
match_max 5000
#look for password prompt
log_user 0
expect {
"denied" {puts "CRITICAL: wrong SSH password" ; exit 2}
"Name or service not known" {puts "CRITICAL: cannot resolve SSH server name $ipaddr" ; exit 2}
"Connection refused" {puts "CRITICAL: SSH connection to $ipaddr refused" ; exit 2}
"Connection timed out" {puts "CRITICAL: SSH connection to $ipaddr timed out" ; exit 2}
timeout {puts "CRITICAL: SSH server timed out while prompting for password" ; exit 2}
"?assword:"
}
# send password
send -- "$password"
# send blank line to make sure we get back to gui
send -- "r"
expect "r"
log_user 1
# now we wait up to 30 seconds
set timeout 30
expect {
timeout {puts "CRITICAL: execution of $scriptname timed out after 30 seconds" ; exit 2}
eof
}
set waitret [wait]
catch {close}
set state [lindex $waitret 2]
exit [lindex $waitret 3
What this script does is fairly easy to understand (once it's been
explained to you!). It starts ssh with the passed arguments
(a maximum of 8), against the server you specify and with a password
you specify as well. It returns the status value of the remoted (remotely
invoked) command.
The script suppresses any SSH output not related to the command,
so beware: if the password is wrong, you will not be told. The script
also make SSH not prompt for host authentication, so if you're finicky
about security, perhaps this is the wrong approach for you. But it works
for me, so let's go on. Again, keep reading.
19 February 2009\
When using m4 to configure Nagios, great advantages can be realized.
One of the easiest places to gain an advantage by using m4 is when defining
a new host.
Typically, a new host not only has a host definition but a number
of fairly standardized services - such as ping, FTP, telnet, SSH, and
so forth. Thus, when defining a new host configuration, you not
only have to add a new host, but all of the relevant services as well
- and may also include host extra info and service extra info also.
#----------------------------------------
# HOST: marco
#----------------------------------------
define host{
use hpux-host ; Name of host template
host_name marco
address 192.168.4.1
}
define hostextinfo{
host_name marco
action_url http://marco-mp/
}
define service{
use passive-service ; Name of servi
host_name marco
service_description System Load
servicegroups Load
}
define service{
use hpux-service ; Name of service
host_name marco
service_description PING
check_command check_ping!100.0,20%!500.0,60%
}
define service{
use hpux-service ; Name of service
host_name marco
service_description TELNET
servicegroups TELNET
check_command check_telnet
}
define serviceextinfo{
host_name marco
service_description TELNET
action_url telnet://marco
}
define service{
use hpux-service ; Name of service
host_name marco
service_description FTP
servicegroups FTP
check_command check_ftp
}
define service{
use hpux-service ; Name of service
host_name marco
service_description NTP
servicegroups NTP
check_command check_ntp
}
define service{
use hpux-service ; Name of service
host_name marco
service_description SSH
servicegroups SSH
check_command check_ssh
}
Compare that output from the m4 code that generated it:
DEFHPUX(`marco',`192.168.4.1')
Another benefit is that if DEFHPUX is coded correctly (with each
service independent - such as an m4 macro DOSSH for SSH) - then a single
change to the m4 file, propogated to the Nagios config file, can alter
a service for every HP-UX host (in this example).
Here is a possible definition of DEFHPUX:
define(`DEFHPUX',`
#----------------------------------------
# HOST: $1
#----------------------------------------
define host{
use hpux-host ; Name of host template
host_name $1
address $2
}
define hostextinfo{
host_name $1
action_url http://$1-mp/
}'
DOLOAD(`$1')
DOPING(`$1')
DOTELNET(`$1')
DOFTP(`$1')
DONTP(`$1')
DOSSH(`$1')
There is a lot more that m4 can do; this is just the tip of the iceberg.
Powered by
ScribeFire.
Nagios Howto: Using NRPE To Monitor Remote Services
This whitepaper is a continuation to the previously article,
Nagios Howto: Notification Escalations, EventHandlers & Remote Service
Monitoring With NRPE.
As previously mentioned, our focus assumes the use of Linux and a
working Nagios installation. I highly suggest you go back to read the
previous
Nagios howto as it contains important information that we will building
upon as we move into the second part of this whitepaper.
Thank you for rejoining if you have already read the first Crucial
Nagios whitepaper.
As you have likely seen, the Nagios docs leave a bit to be desired
when it comes to information on the NRPE plugin. In its simplest form,
the NRPE plugin allows you to monitor any number of remote network devices
and services using a single Nagios installation. However, when we combine
EventHandlers with NRPE we then have the ability to repair our remote
servers—self-healing servers. For now, we will focus our attention to
NRPE and walk through the steps to properly configure your NRPE daemon.
Download NRPE Plugin
The NRPE source code and default plugin is available from the
Nagios website. You will need to download the NRPE plugin and any
other plugins to the remote machine that you intend to monitor:
cd /usr/src
wget http://umn.dl.sourceforge.net/sourceforge/nagiosplug/nagios-plugins-1.4.6.tar.gz
tar zxvf nagios-plugins-1.4.6.tar.gz
cd nagios-plugins-1.4.6
The instructions above will download and extract the the Nagios plugins,
as well as change into that directory.
Build The Source Code
We now need to build the source code. This step needs to be done
on each remote system that you plan to monitor. Follow the steps below
to build the default plugin set:
./configure –prefix=/usr/local/nagios
make
make install
We now have /usr/local/nagios/libexec/ which contains the
default plugin set.
At this time we need to download and install the NRPE daemon and
plugin. The steps below detail the commands needed for execution:
cd /usr/src/
wget http://internap.dl.sourceforge.net/sourceforge/nagios/nrpe-2.7.tar.gz
tar zxvf nrpe-2.7.tar.gz
cd nrpe-2.7
./configure
make all
Move Things Around
Now we need to manually move the files into place:
cp src/nrpe /usr/local/nagios/libexec/
cp src/check_nrpe /usr/local/nagios/libexec/
cp sample-config/nrpe.cfg /usr/local/nagios/libexec/
We now have our executables in place and are ready to begin configuring
the NRPE daemon on the remote system.
Configuration
The sample configuration file we copied above is a well documented
file. You should take the time to read this file and familiarize yourself
with the configuration options that we will be setting below. Open the
nrpe.cfg file in the Nagios libexec directory in your
favorite editor.
We are going to leave some default settings and change a few settings
for our needs. Set the following configuration options as follows:
pid_file=/var/run/nrpe.pid
server_port=5666
# Set this if you want to nail NRPE to specific IP address
# server_address=192.168.1.1
nrpe_user=nagios
nrpe_group=nagios
# Set this to the remove Nagios installation IP address
allowed_hosts=127.0.0.1
dont_blame_nrpe=0
# command_prefix=/usr/bin/sudo
# Set this to 1 for logging in syslog
debug=0
command_timeout=60
connection_timeout=300
# allow_weak_random_seed=1
Thats it for the configuration of NRPE.
Commands
We now need to look at the available commands to NRPE. If you scroll
to the bottom of the nrpe.cfg file you will see the default commands***.
The commands are structured like so:
command[check_disk1]=/usr/local/nagios/libexec/check_disk
-w 20 -c 10 -p /dev/hda1
Command names are completely arbitrary and can be created on the
fly, e.g.:
command[check_disk2]=/usr/local/nagios/libexec/check_disk
-w 20 -c 10 -p /dev/hdb1
Very simple format, check_disk1 is the command name located
at /usr/local/nagios/libexec/check_disk with the arugments
-w 20 -c 10 -p /dev/hda1. I used this particular command because
it contains the disk check—this is the one command that you may possibly
need to alter immediately for effective use. At the end of the command
we see the path of the disk device to check on, /dev/hda1. You
may not have this drive configuration so you will need to replace that
with the path to your local disk setup. An easy way to figure this out
is to issue the command df -h and use the returned entry for
home as this is the primary usage space for most.
System Setup
At this point, we have completed configuring NRPE and we need to
setup the system to accommodate Nagios.
First we need to setup permissions for the Nagios user.
adduser nagios
chown -R nagios.nagios /usr/local/nagios/
We’ve setup our Nagios user and changed the ownership of all the
files under the nagios/ dir.
Now we need to edit the file /etc/services and add the following
line:
nrpe 5666/tcp # NRPE
Now, we need to tell our inetd or xinetd about NRPE. Create a file
in /etc/xinetd.d/ called nrpe, and add the following to
that file:
# default: on
# description: NRPE
service nrpe {
flags = REUSE
socket_type = stream
wait = no
user = nagios
server = /usr/local/nagios/libexec/nrpe
server_args = -c /usr/local/nagios/libexec/nrpe.cfg –inetd
log_on_failure += USERID
disable = no
# Change this to your primary Nagios server
only_from = 127.0.0.1
}
This describes to the "super server" the various options necessary
to launch the NRPE daemon when our remote Nagios monitoring system connects.
Now, open the /etc/hosts.allow file and add an entry for the
IP address of your remote monitoring server. If you have a firewall,
you will also want to configure it so that you allow remote connections
from the IP address of your remote monitoring system to port 5666.
Restart your xinetd daemon to reload the configuration changes:
/etc/init.d/xinetd reload
Let’s test it out real quick to make sure nothing has gone wrong
so far. From your remote monitoring server issue the following command:
telnet ip.address.of.remote.nrpe 5666
If the connection immediately closes you’ve got a problem and something
isn’t right. If the socket opens and you are met with the following:
Escape character is ‘^]’.
Then y ou’re ready to move on. If you’ve got problems at this point,
go back through each of the steps above and check for any errors in
configuration. Since we’ve enabled DEBUG in our nrpe.cfg
you can also view your syslog file for failure information.
Add New Host
We are now ready to add our new host to our primary Nagios installation.
This is very straight forward and should only take a moment.
Back on the primary Nagios installation server we need to edit our
hosts.cfg configuration file. The file is located in /usr/local/nagios/etc/hosts.cfg.
This may change depending on your installation and organization of configuration
files. Read the first part of
this whitepaper for organization advise.
In the hosts.cfg file, add your new host object:
define host{
use generic-host
# Hostname of remote system
host_name host.domain.com
# A friendly name for this server
alias Friendly name
# Remote host IP address
address 127.0.0.1
check_command check-host-alive
max_check_attempts 10
notification_interval 30
notification_period 24×7
notification_options d,r
# Your defined contact group name
contact_groups admins
}
At this time our hosts.cfg file contains two hosts objects,
the localhost which is running the Nagios application and our remote
host which we will be monitoring.
We now want to add the service objects to our services.cfg
file located in the same directory. Add the following single service
to your services.cfg file:
define service{
use generic-service
# Hostname of remote system
host_name host.domain.com
service_description Primary Disk Usage
is_volatile 0
check_period 24×7
max_check_attempts 3
normal_check_interval 5
retry_check_interval 1
# Change to your contact group
contact_groups admins
notification_options w,u,c,r
notification_interval 10
notification_period 24×7
check_command check_nrpe!check_disk1
}
You can view the Nagios documentation for the full details on each
of these object configuration options. You will likely want to alter
from the values shown above to your monitoring environment. However,
we will take a look at that last line, the check_nrpe option.
check_nrpe
When monitoring remote services, we first issue a check_nrpe
command followed by a ! and the command on the remote machine
to run. This means that we are going to need an instance of check_nrpe
on our Nagios Server. Simply follow the directions above to download,
build, and install the NRPE check_nrpe script and the nrpe daemon.
Once you have installed these on the Nagios primary server, then we
can proceed.
Now that nrpe is installed on the primary Nagios server, and our
new host and host service is configured, we can reload nagios service:
/etc/init.d/nagios reload
Web Interface
With the configuration read, you should now be able to access the
web interface of Nagios. Under the Service Detail link you should
see both the new remote host and the server/service we have setup to
monitor. It is likely in an Unknown State at this time as the
service has not been checked yet.
According to our service definition above, this service will be checked
once every five minutes. If all has gone well, we should see the green
in less than 5 minutes, which confirms proper installation and configuration
of NRPE. In failure the service will go into a Soft State for
two additional minutes. Once a Hard Failure state is achieved,
you will see red and you should be able to check your Nagios log file
in nagios/var/nagios.log for further information.
There are a lot of moving parts with this project so it is best to
focus on a single server and a single service. Once you have a service
properly configured it is a short step to configure the next service.
Simply copy the service object created above and change the nrpe_check!’command_issued‘.
What You Can Do
Many things can be monitored with NRPE that can not be monitored
remotely by Nagios without NRPE. These include:
- Disk space
- Zombie processes
- Number of shell users
- Total processes
- Load average
And any thing else that doesn’t run as a public service on the server.
Obviously, the advantage to remotely monitoring these server objects
in a central location is that a problem may be much more quickly identified.
This combined with the previous whitepaper’s escalation procedures provide
an effective response tool for reactively monitoring remote servers.
Remember, any of the commands that you have in the nagios/libexec
folder are available to NRPE. To run these commands on the remote server,
you simply need to setup the command in the nrpe.cfg file on
the remote server. Here is an example using check_load:
command[check_load]=/usr/local/nagios/libexec/check_load -w
15,10,5 -c 30,25,20
The -w and the -c are the Warn and Critical
thresholds
Follow the steps above to add the new service,check_load,
to your services.cfg file. Reload Nagios and that’s that.
EventHandler
In the next whitepaper, we will change our focus to the Nagios
EventHandler. I will demonstrate to you how to repair problems that
Nagios encounters before even contacting a single human. At
Crucial Web Hosting, we
make extensive use of the EventHandler object in Nagios and we credit
it for a very happy support team. Using the EventHandler objects
we can diagnose and repair common problems that occur on local and remote
servers in a matter of minutes and seconds as opposed to hours and days.
We will be performing root tasks using the sudo method and we will
create a simple custom EventHandler on a remote server thus demonstrating
how you can roll your own Nagios plugins.
In the next whitepaper you will learn how to make your servers
heal themselves with no human interaction!
If you missed the first whitepaper in the series I am writing, you
can access it
here. I look forward to your questions and comments.
Just wondering if anyone has ant opinion about Zenoss vs Nagios.
I have helped people set up one or the other, but I have not used either
of them over time.
Zenoss looks like it has a nicer user interface, but it was less intuitive
for how it was set up. (I used the NagioSql
tool for administering much of Nagios, and that made the setup much
nicer)
check_mk is a general purpose Nagios-plugin for retrieving data.
It adopts a new approach for collecting data from operating systems
and network components. It obsoletes NRPE, check_by_ssh, NSClient, and
check_snmp and it has many benefits, the most important of which are
significant reduction of CPU usage on the Nagios host and automatic
inventory of items to be checked on hosts. The larger your Nagios installation
is, the more helpful these improvements.
Interesting presentation about splitting Nagios into multiple domains
(I think that each 100 servers requires separate instance if check are extensive)
and using passive checks to avoid bottleneck of "agent less probes".
Configuration file generation can be a big help in case servers are similar.
Large deployment requires configuration management of Nagios config files.
It's interesting how they "reinvented the bicycle" for some concepts like
querying of alerts, etc which should be in the enterprise monitoring system
from the very beginning :-)
This is the second part of the two part series by Wojciech
Kocjan in which we have made an effort to cover everything in
notifications and events in Nagios 3.0.The first part covered:
In this article, we will cover the following sub-topics:
- External Commands
- Event Handlers
- Modifying Notifications
- Adaptive Monitoring
External Commands
Nagios offers a very powerful mechanism for receiving events and
commands from external applications—the external commands pipe. This
is a pipe file created on a file system that Nagios uses to receive
incoming messages. The name of the file is
rw/nagios.cmd and it is located
in the directory passed as the localstatedir
option during compilation. Following the compilation and installation
instructions and the given guidelines, the file name will be
/var/nagios/rw/ nagios.cmd.
The communication does not use any authentication or authorization—the
only requirement is to have write access to the pipe file. An external
command file is usually writable by the owner and the group; the usual
group used is nagioscmd. If
you want a user to be able to send commands to the Nagios daemon, simply
add that user to this group.
A small limitation of the command pipe is that there is no way to
get any results back and so it is not possible to send any query commands
to Nagios. Therefore, by just using the command pipe, you have no verification
that the command you have just passed to Nagios has actually been processed,
or will be processed soon. It is, however, possible to read the Nagios
log file and check if it indicates that the command has been parsed
correctly, if necessary.
An external command pipe is used by the web interface to control
how Nagios works. The web interface does not use any other means to
send commands or apply changes to Nagios. This gives a good understanding
of what can be done with the external command pipe interface.
From the Nagios daemon perspective, there is no clear distinction
as to who can perform what operations. Therefore, if you plan to use
the external command pipe to allow users to submit commands remotely,
you need to make sure that the authorization is in place as well so
that it is not possible for unauthorized users to send potentially dangerous
commands to Nagios.
The syntax for formatting commands is easy. Each command must be
placed on a single line and end with a newline character.
Tuesday, April 7, 2009 | Linux Servers
In this article by Wojciech Kocjan, we will learn about
troubleshooting Nagios 3.0 which includes troubleshooting the web interface,
passive checks, SSH-Based checks, and NRPE.The article includes various
possible errors along with their solutions and detailed explanations
for each error listed out.
See More
Notifications and Events in Nagios 3.0- part2 Tuesday,
May 12, 2009 | Linux ServersThis is the second part of the two
part series by Wojciech Kocjan in which we have made an effort
to cover everything in notifications and events in Nagios 3.0.The first
part covered:
In this article, we will cover the following sub-topics:
- External Commands
- Event Handlers
- Modifying Notifications
- Adaptive Monitoring
See More
Learning Nagios 3.0 Table of Contents Monday, October 27,
2008 | All
See More
Notifications and Events in Nagios 3.0-part1 Monday, May
11, 2009 | Linux Servers
This is a 2-part series by Wojciech Kocjan. We have made
an attempt to cover all about events and notifications in Nagios 3.0
in detail in this series. The following sub-topics will be covered as
a part of this series:
- Effective Notifications
- Escalations
- External Commands
- Event Handlers
- Modifying Notifications
- Adaptive Monitoring
See More
Troubleshooting Nagios 3.0 Tuesday, April 7, 2009 | WordPress
In this article by Wojciech Kocjan, we will learn about troubleshooting
Nagios 3.0
See More
Troubleshooting Nagios 3.0 Tuesday, April 7, 2009 | Linux
Servers
In this article by Wojciech Kocjan, we will learn about troubleshooting
Nagios 3.0 which includes troubleshooting the web interface, passive
checks, SSH-Based checks, and NRPE.The article includes various possible
errors along with their solutions and detailed explanations for each
error listed out.
See More
Passive Checks and NSCA (Nagios Service Check Acceptor)
Wednesday, November 19, 2008 | Networking & Telephony
Nagios is a very powerful platform because it is easy to extend. A great
feature that Nagios offers is the ability for third-party software or
other Nagios instances to report information on the status of services
or hosts. This way, Nagios does not need to schedule and run checks
by itself, but other applications can report information as it is available
to them. This means that your applications can send problem reports
directly to Nagios, instead of just logging them. In this way, your
applications can benefit from powerful notification systems as well
as dependency tracking. In this article by Wojciech Kocjan, we
will see how this mechanism can also be used to receive failure notifications
from other services or machines—for example, SNMP traps.
See More
A while back, I wrote an article for Linux Journal's web edition
entitled
“Howto be a good (and lazy) System Administrator.” A couple astute
readers, after reading the article, asked if I was familiar with the
Nagios monitoring system, and I am. I've been using Nagios for a few
years now.
I had intended to write this article as a How-to on getting Nagios
configured and running for the first time. However, it turns out that
the documentation that comes with Nagios is really pretty good. And
even if you do have problems, and I did, the user community is also
quite responsive. So, rather than beating a dead horse, (with sympathy
to horse lovers) I decided to continue the Good and Lazy Administrator
Theme and discuss extending Nagios with custom service checks and custom
notifications.
Nagios uses a plug-in mechanism to implement all of it's server and
service checks as well as all of it's notifications. This is good news
for hackers, as it allows us to build new functionality that either
no one else has though of, or has need of. I wrote a couple scripts
for my Nagios system. One does a custom service check to see if I have
voicemail waiting for me at the Help Desk, and the other does a custom
notification by telephone. Before I go on, I should give a little bit
of background.
I maintain several servers, both for myself and for customers. These
servers range from web servers to phone systems running Asterisk. Just
like most System Administrators, I don't have any “optional” servers
or services; the stuff just has to work, and when it isn't, I need to
know. But I'll tell you, I'm not interested in sitting at the desk watching
the Help Desk phone or the monitoring screen. I'm either too lazy, or
too busy. Either way, that's what silicon is for, right?
My phone system at the house runs on Asterisk. You can read more
about my home infrastructure at http://www.linuxjournal.com/article/9111.
My Nagios server runs on the same server, so it just makes sense to
integrate the two services.
I've created a Nagios script that monitors the Help Desk voicemailbox
and sets a service alert if there are any critical alerts in Nagios.
I've also written a script that can call me, perhaps on my cellphone,
in the event of a service outage. With these two scripts in place, I
can get a call on my cellphone any time someone calls my Help Desk and
leaves a message. I can also get a call if any of my monitored services
fail. Theoretically, I can be at a park playing with my boys and know
that my servers are happy... until the cellphone rings.
I understand that I have kind of a unique situation, but the same
concept is applicable in a business production environment, so lets
get down to looking at code.
First, let's talk about the Help Desk monitoring script. Essentially,
this script checks to see if there are any files in the INBOX in the
Help Desk mailbox. Here is the code:
#!/usr/bin/perl -w
local *DIR;
my ($file, $error);
$error = 0;
opendir DIR, "/var/spool/asterisk/voicemail/customers/611/INBOX/"
or die("Error: Permission denied\n");
while ($file = readdir(DIR)) {
if ($file eq ".") { next; }
if ($file eq "..") { next; }
$error++;
}
$error = $error/4;
if (!$error) {
print "OK\n";
exit 0;
} else {
print "CRITICAL: $error\n";
}
exit 2;
Of course, you need to make sure that the Nagios user has access
to the Asterisk voicemailbox, but that can be taken care of by setting
the script set-uid. The script, as you can see, is pretty simple. If
there are any other files in the directory, the script assumes that
there is a voicemail and sends a CRITICAL alert to Nagios. Otherwise
everything is OK.
To enable Nagios to use this check script, we need to define it in
checkcommands.cfg. Here is the definition, I used:
define command{
command_name check_help
command_line /etc/nagios/local/check_611.pl
}
Now, I can refer to the check_help check script in the services.cfg
file. Here's how I did it:
define service {
use generic-service
name Help_Desk
host_name my_server
service_description Help Desk Voicemail
check_command check_help
register 1
}
With this configuration in place, Nagios can indicate an alarm any
time there is voicemail in the Help Desk mailbox. But that's only half
of what I promised to write about. The next script allows Nagios to
call me to let me know that I've got a fire to put out. Here is that
script:
#!/usr/bin/perl
foreach $main::phone ("15055551234") {
$main::call = <
MaxRetries: 0
RetryTime: 1
WaitTime: 120
Account: Enterprise
Context: apps
Extension: OUTAGE
Priority: 1
EOF
;
open FILE, ">/tmp/outage.call";
print FILE $main::call;
close FILE;
system("mv /tmp/outage.call /var/spool/asterisk/outgoing");
}
As you can see, this script isn't complicated, either. It simply
creates an Asterisk “call file” and puts it in Asterisk's outgoing spool
directory. The script is capable of calling multiple numbers... just
in case. It's important to that the call file be created in another
directory and moved into the spool directory. Otherwise bad things can
happen if Asterisk tries to read the file while the script is still
writing it.
Obviously this script relies on some configuration in the Asterisk
dial plan. Here is the relevant part of the dial plan:
exten => OUTAGE,1,answer
exten => OUTAGE,2,playback(/etc/asterisk/sounds/OUTAGE)
exten => OUTAGE,3,hangup
At this point, you're probably realizing that I'm not doing anything
complicated. All that is needed from Asterisk's point of view is an
audio message in /etc/asterisk/sounds/OUTAGE (.wav or .au) that indicates
that something is on fire. Asterisk will select the most reasonable
file extension and play the file when the call is answered.
So all that is left to do is configure Nagios to use this notification
method. This is configured in the misccommands.cfg file. Here is how
I did it:
# 'notify_by_phone' command definition
define command{
command_name notify_by_phone
command_line /etc/nagios/local/notify_by_phone.pl
}
Now that all of the configuration is done, we restart Nagios and
reload the Asterisk dial plan. To do this, we type “/etc/init.d/nagios
restart” at the command line and “extensions reload” at the Asterisk
console.
So now, anytime I have voicemail at the Help Desk, it's indicated
in the Nagios monitoring screen as a critical alert. Also, anytime any
of my servers or services are unavailable, I can get a phone call on
either my home phone or my cell phone. This means that my customers
don't HAVE to have those phone numbers and I can still provide quality
service to them.
Now I realize that I have a unique situation, but I hope that this
article serves as an example of how to create custom Nagios service
checks and notifications, as well as hinting at some of the integration
options available in Asterisk.
__________________________
Mike Diehl is a freelance Computer Nerd specializing in Linux administration,
programing, and VoIP. Mike lives in Albuquerque, NM. with his wife and
3 sons. He can be reached at mdiehl@diehlnet.com
This is the second article in a
two-part series that looks at a hands-on approach to monitoring
a data center using the open source tools Ganglia and Nagios. In Part
2, learn how to install and configure Nagios, the popular open source
computer system and network monitoring application software that watches
hosts and services, alerting users when things go wrong. The article
also shows you how to unite Nagios with Ganglia (from
Part 1) and add two other features to Nagios for standard clusters,
grids, and clouds to help with monitoring network switches and the resource
manager.Recap of Part 1
Data centers are growing and administrative staffs are shrinking,
necessitating efficient monitoring tools for compute resources. Part
1 of this series discussed the benefits of using Ganglia and Nagios
together, then showed you how to install and extend Ganglia with homemade
monitoring scripts.
Recall from
Part 1 the multiple definitions of monitoring (depending
on the implier and the inferrer):
- If you're running applications on the cluster, you think: "When
will my job run? When will it be done? And how is it performing
compared to last time?"
- If you're the operator in the network operations center, you
think: "When will we see a red light that means something needs
to be fixed and a service call placed?"
- If you're in the systems engineering group, you think: "How
are our machines performing? Are all the services functioning correctly?
What trends do we see, and how can we better utilize our compute
resources?"
You can find code to monitor exactly what you want to monitor
and that code can be of the open source variety. The most difficult
part of using open source monitoring tools comes when you attempt to
implement an install and puzzle out a configuration that works well
for your environment. Two major problems with open source (and commercial)
monitoring tools are the following:
- No tool will monitor everything you want the way you want it.
- Much customization could be required to get the tool working
in your data center exactly how you want it.
Ganglia is a tool that monitors data centers and is used heavily
in high-performance computing environments (but it's attractive for
other environments too like clouds, render farms, and hosting centers).
It is more concerned with gathering metrics and tracking them over time
compared with Nagios's focus as an alerting mechanism. Ganglia used
to require an agent to run on every host to gather information from
it, but now metrics can be obtained from just about anything through
Ganglia's spoofing mechanism. Ganglia doesn't have a built-in notification
system, but it was designed to support scalable built-in agents on target
hosts.
After reading Part 1, you could install Ganglia, as well as answer
the monitoring questions that different user groups tend to ask. You
could also configure the basic Ganglia setup, use the Python modules
to extend functionality with IPMI (the Intelligent Platform Management
Interface), and use Ganglia host spoofing to monitor IPMI.
Now, let's look at Nagios.
Introducing Nagios
This part shows you how to install Nagios and tie Ganglia back into
it. We're going to add two features to Nagios that'll help your monitoring
efforts in standard clusters, grids, clouds (or whatever your favorite
buzzword is for scale-out computing). The two features are all about:
- Monitoring network switches
- Monitoring the resource manager
In this case, we'll be monitoring TORQUE. When we are finished, you'll
have a framework to control the monitoring system of your entire data
center.
Nagios, like Ganglia, is used heavily in HPC and other environments,
but Nagios is more of an alerting mechanism that Ganglia (which is more
focused on gathering and tracking metrics). Nagios previously only polled
information from its target hosts, but has recently developed plug-ins
that allow it to run agents on those hosts. Nagios has a built-in notification
system.
Now let's install Nagios and set up a baseline monitoring system
of an HPC Linux® cluster to address the three different monitoring perspectives:
- The application person can see how full the queues are and see
available nodes for running jobs.
- The NOC can be alerted of system failures or see a shiny red
error light on the Nagios Web interface. They also get notified
via email if nodes go down or temperatures get too high.
- The system engineer can graph data, report on cluster utilization,
and make decisions on future hardware acquisitions.
check_openmanage is a plugin for Nagios that checks the hardware
health of Dell PowerEdge and PowerVault servers. It uses the Dell OpenManage
Server Administrator (OMSA) software to accomplish this task. check_openmanage
can be used remotely with SNMP or locally with NRPE. The plugin checks
the health of the storage subsystem, power supplies, memory modules,
temperature probes, etc., and gives an alert if any of the components
are faulty or operate outside normal parameters.
Changes: The --global option was
added, which turns on checking of everything.
If used with SNMP, the global system health
status is also probed, to protect the user
against bugs in the...
plugin. If used with omreport, the overall
chassis health is used. Support for SNMP
version 3 was added. Checking of esmhealth
was added, which checks the overall health
of the ESM log, i.e. the fill grade. Alert
log reporting was fixed to use the same
format as for the ESM log. Output messages
are now sorted by severity. Minor changes
were made in how out-of-date controller
firmware/driver is reported
Nagiosgraph is an add-on for Nagios. It collects service perfdata in
RRD format, and displays the resulting graphs via CGI.
September 11th, 2008 by
Mike Diehl in
I've created a Nagios script that monitors the Help Desk voicemailbox
and sets a service alert if there are any critical alerts in Nagios.
I've also written a script that can call me, perhaps on my cellphone,
in the event of a service outage. With these two scripts in place, I
can get a call on my cellphone any time someone calls my Help Desk and
leaves a message. I can also get a call if any of my monitored services
fail. Theoretically, I can be at a park playing with my boys and know
that my servers are happy... until the cellphone rings.
I understand that I have kind of a unique situation, but the same
concept is applicable in a business production environment, so lets
get down to looking at code.
First, let's talk about the Help Desk monitoring script. Essentially,
this script checks to see if there are any files in the INBOX in the
Help Desk mailbox. Here is the code:
#!/usr/bin/perl -w
local *DIR;
my ($file, $error);
$error = 0;
opendir DIR, "/var/spool/asterisk/voicemail/customers/611/INBOX/"
or die("Error: Permission denied\n");
while ($file = readdir(DIR)) {
if ($file eq ".") { next; }
if ($file eq "..") { next; }
$error++;
}
$error = $error/4;
if (!$error) {
print "OK\n";
exit 0;
} else {
print "CRITICAL: $error\n";
}
exit 2;
Of course, you need to make sure that the Nagios user has access
to the Asterisk voicemailbox, but that can be taken care of by setting
the script set-uid. The script, as you can see, is pretty simple. If
there are any other files in the directory, the script assumes that
there is a voicemail and sends a CRITICAL alert to Nagios. Otherwise
everything is OK.
To enable Nagios to use this check script, we need to define it in
checkcommands.cfg. Here is the definition, I used:
define command{
command_name check_help
command_line /etc/nagios/local/check_611.pl
}
Now, I can refer to the check_help check script in the services.cfg
file. Here's how I did it:
define service {
use generic-service
name Help_Desk
host_name my_server
service_description Help Desk Voicemail
check_command check_help
register 1
}
With this configuration in place, Nagios can indicate an alarm any
time there is voicemail in the Help Desk mailbox. But that's only half
of what I promised to write about. The next script allows Nagios to
call me to let me know that I've got a fire to put out. Here is that
script:
#!/usr/bin/perl
foreach $main::phone ("15055551234") {
$main::call = <
MaxRetries: 0
RetryTime: 1
WaitTime: 120
Account: Enterprise
Context: apps
Extension: OUTAGE
Priority: 1
EOF
;
open FILE, ">/tmp/outage.call";
print FILE $main::call;
close FILE;
system("mv /tmp/outage.call /var/spool/asterisk/outgoing");
}
As you can see, this script isn't complicated, either. It simply
creates an Asterisk “call file” and puts it in Asterisk's outgoing spool
directory. The script is capable of calling multiple numbers... just
in case. It's important to that the call file be created in another
directory and moved into the spool directory. Otherwise bad things can
happen if Asterisk tries to read the file while the script is still
writing it.
Obviously this script relies on some configuration in the Asterisk
dial plan. Here is the relevant part of the dial plan:
exten => OUTAGE,1,answer
exten => OUTAGE,2,playback(/etc/asterisk/sounds/OUTAGE)
exten => OUTAGE,3,hangup
At this point, you're probably realizing that I'm not doing anything
complicated. All that is needed from Asterisk's point of view is an
audio message in /etc/asterisk/sounds/OUTAGE (.wav or .au) that indicates
that something is on fire. Asterisk will select the most reasonable
file extension and play the file when the call is answered.
So all that is left to do is configure Nagios to use this notification
method. This is configured in the misccommands.cfg file. Here is how
I did it:
# 'notify_by_phone' command definition
define command{
command_name notify_by_phone
command_line /etc/nagios/local/notify_by_phone.pl
}
Now that all of the configuration is done, we restart Nagios and
reload the Asterisk dial plan. To do this, we type “/etc/init.d/nagios
restart” at the command line and “extensions reload” at the Asterisk
console.
So now, anytime I have voicemail at the Help Desk, it's indicated
in the Nagios monitoring screen as a critical alert. Also, anytime any
of my servers or services are unavailable, I can get a phone call on
either my home phone or my cell phone. This means that my customers
don't HAVE to have those phone numbers and I can still provide quality
service to them.
Now I realize that I have a unique situation, but I hope that this
article serves as an example of how to create custom Nagios service
checks and notifications, as well as hinting at some of the integration
options available in Asterisk.
__________________________
Mike Diehl is a recently self-employed Computer Nerd and lives in
Albuquerque, NM. with his wife and 3 sons. He can be reached at mdiehl@diehlnet.com
By Wojciech Kocjan on November
20, 2008 (8:00:00 PM)
System monitoring tool Nagios offers
a powerful mechanism for receiving events and commands from external
applications. External commands are usually sent from event handlers
or from the Nagios Web interface. You will find external commands most
useful when writing event handlers for your system, or when writing
an external application that interacts with Nagios.
This article is excerpted from the newly published book
Learning Nagios 3.0 from
Packt
Publishing.
The external commands pipe is a pipe file created on a filesystem
that Nagios uses to receive incoming messages. The communication does
not use any authentication or authorization -- the only requirement
is to have write access to the pipe file, rw/nagios.cmd, which is located
in the directory passed as the localstatedir option during compilation.
An external command file is usually writable by the owner and the
group; the usual group used is nagioscmd. If you want a user to be able
to send commands to the Nagios daemon, simply add that user to this
group.
A small limitation of the command pipe is that there is no way to
get any results back, so it is not possible to send any query commands
to Nagios. Therefore, by just using the command pipe, you have no verification
that the command you have passed to Nagios has been processed, or will
be processed soon. It is, however, possible to read the Nagios log file
and check whether it indicates that the command has been parsed correctly.
The Nagios Web interface uses an external command pipe to control
how Nagios works. The Web interface does not use any other means to
send commands or apply changes to Nagios.
From the Nagios daemon perspective, there is no clear distinction
as to who can perform what operations. Therefore, if you plan to use
the external command pipe to allow users to submit commands remotely,
you need to make sure that authorization is in place so that unauthorized
users cannot send potentially dangerous commands to Nagios.
The syntax for formatting commands is easy. Each command must be
placed on a single line and end with a newline character. The syntax
is as follows:
[TIMESTAMP] COMMAND_NAME;argument1;argument2;...;argumentN
TIMESTAMP is written as Unix time -- that is, the number of seconds
since 1970-01-01 00:00:00. You can create this by using the date command.
Most programming languages also offer the means to get the current Unix
time.
Commands are written in upper case. The arguments depend on the actual
command. For example, to add a comment to a host stating that it has
passed a security audit, you can use the following shell command:
echo "['date +%s'] ADD_HOST_COMMENT;somehost;1;Security Audit;
This host has passed security audit on 'date +%Y-%m-%d'" >/var/nagios/rw/nagios.cmd
This will send an
ADD_HOST_COMMENT command to Nagios over the external command pipe.
Nagios will then add a comment to the host, somehost, stating that the
comment originated from Security Audit. The first argument specifies
the host name to add the comment to; the second tells Nagios if this
comment should be persistent. The next argument describes the author
of the comment, and the last argument specifies the actual comment text.
Similarly, adding a comment to a service requires the use of the
ADD_SVC_COMMENT command. The command's syntax is similar to that
of the ADD_HOST_COMMENT command except that the command requires the
specification of the host name and service name.
You can also delete a single comment or all comments using the
DEL_HOST_ COMMENT,
DEL_ALL_HOST_COMMENTS, and
DEL_SVC_COMMENT or
DEL_ALL_SVC_COMMENTS commands.
Other commands worth mentioning are related to scheduling checks
on demand. Often, it is necessary to request that a check be carried
out as soon as possible; for example, when testing a solution.
You can create a script that schedules a check of a host, all services
on that host, and a service on a different host, as follows:
#!/bin/sh NOW='date +%s'
echo "[$NOW] SCHEDULE_HOST_CHECK;somehost;$NOW" \ >/var/nagios/rw/nagios.cmd
echo "[$NOW] SCHEDULE_HOST_SVC_CHECKS;somehost;$NOW" \ >/var/nagios/rw/nagios.cmd
echo "[$NOW] SCHEDULE_SVC_CHECK;otherhost;Service Name;$NOW"
\ >/var/nagios/rw/nagios.cmd exit 0
The commands
SCHEDULE_HOST_CHECK and
SCHEDULE_HOST_SVC_CHECKS accept a host name and the time at which
the check should be scheduled. The
SCHEDULE_SVC_CHECK command requires the specification of a service
description as well as the name of the host to schedule the check on.
Normal scheduled checks, such as the ones scheduled above, might
not actually take place at the time that you scheduled them. Nagios
also needs to take allowed time periods into account as well as checking
whether checks were disabled for a particular object or globally for
the entire Nagios.
There are cases when you'll need to force Nagios to do a check --
in such cases, you should use
SCHEDULE_FORCED_HOST_CHECK,
SCHEDULE_FORCED_HOST_SVC_CHECKS, and
SCHEDULE_FORCED_SVC_CHECK commands. They work in exactly the same
way as described above, but make Nagios skip the checking of time periods,
and ensure that the checks are disabled for this particular object.
This way, a check will always be performed, regardless of other Nagios
parameters.
Other commands worth using are related to custom variables, introduced
in Nagios 3. When you define a custom variable for a host, service,
or contact, you can change its value on the file with the external command
pipe.
As these variables can then be directly used by check or notification
commands and event handlers, it is possible to make other applications
or event handlers change these attributes directly without modifications
to the configuration files.
How might this work? Suppose that the IT staff registers its presence
via an application without any GUI. This application periodically sends
information about the latest known IP address, and that information
is then passed to Nagios assuming that the person is in the office.
This would later be sent to a notification command to use that specific
IP address while sending a message to the user.
Assuming that the user name is jdoe and the custom variable name
is DESKTOPIP, the message that would be sent to the Nagios external
command pipe would be as follows:
[1206096000] CHANGE_CUSTOM_CONTACT_VAR;jdoe;DESKTOPIP;12.34.56.78
This would cause a subsequent use of $_CONTACTDESKTOPIP$ to return
a value of 12.34.56.78.
Nagios offers the
CHANGE_CUSTOM_CONTACT_VAR,
CHANGE_CUSTOM_HOST_VAR, and
CHANGE_CUSTOM_ SVC_VAR commands for modifying custom variables in
contacts, hosts, and services.
The commands explained above are just a small subset of the full
capabilities of the Nagios external command pipe. For a complete list
of commands, visit
the External Command List.
Posted by
philcore
on Mon 28 Nov 2005 at 12:23
Nagios is a powerful, modular network monitoring system that can
be used to monitor many network services like smtp, http and dns on
remote hosts. It also has support for snmp to allow you to check things
like processor loads on routers and servers. I couldn't begin to cover
all of the things that nagios can do in this article, so I'll just cover
the basics to get you up and running.
apt-get install nagios-text
First we need to define people that will be notified, and define how
they should be notified. In the example below, I define two users, joe
and paul. Joe is the network guru and cares about routers and switches.
Paul is the systems guy, and he cares about servers. Both will be notified
via email and by pager. Note that if you are going to monitor your email
server, you will want to use another notification method besides email.
If your email server is down, you can't send anybody an email to notify
them! :) In that case you will want to use a pager server to send a
text message to a phone or pager, or set up a second nagios monitor
that uses a different mail server to send email.
Edit /etc/nagios/contacts.cfg and add the following users:
define contact{
contact_name joe
alias Joe Blow
service_notification_period 24x7
host_notification_period 24x7
service_notification_options w,u,c,r
host_notification_options d,u,r
service_notification_commands notify-by-email,notify-by-pager
host_notification_commands host-notify-by-email,host-notify-by-epager
email joe@yourdomain.com
pager 5555555@pager.yourdomain.com
}
define contact{
contact_name paul
alias Paul Shiznit
service_notification_period 24x7
host_notification_period 24x7
service_notification_options w,u,c,r
host_notification_options d,u,r
service_notification_commands notify-by-email,notify-by-epager
host_notification_commands host-notify-by-email,host-notify-by-epager
email paul@yourdomain.com
pager 5556666@pager.yourdomain.com
}
Now add the users to groups.
In /etc/nagios/contactgroups.cfg add the following:
define contactgroup{
contactgroup_name router_admin
alias Network Administrators
members joe
}
define contactgroup{
contactgroup_name server_admin
alias Systems Administrators
members paul
}
You can add multiple members to a contact group by listing comma separated
users.
Now to define some hosts to monitor. For my example, I define two
machines, a mail server and a router.
Edit /etc/nagios/hosts.cfg and add:
define host{
use generic-host
host_name gw1.yourdomain.com
alias Gateway Router
address 10.0.0.1
check_command check-host-alive
max_check_attempts 20
notification_interval 240
notification_period 24x7
notification_options d,u,r
}
define host{
use generic-host
host_name mail.yourdomain.com
alias Mail Server
address 10.0.0.100
check_command check-host-alive
max_check_attempts 20
notification_interval 240
notification_period 24x7
notification_options d,u,r
}
Now we add the hosts to groups. I define groups called 'routers' and
'servers' and add the router and mail server respectively.
Edit /etc/nagios/hostgroups.cfg
define hostgroup{
hostgroup_name routers
alias Routers
contact_groups router_admin
members gw1.yourdomain.com
}
define hostgroup{
hostgroup_name servers
alias Servers
contact_groups server_admin
members mail.yourdomain.com
}
Again, for multiple members, just use a comma separated list of hosts.
Next define services to monitor on each of the hosts. Nagios has
many built-in plugins for monitoring. On a debian sarge system, they
are stored in /usr/lib/nagios/plugins. Here we want to monitor the smtp
service on the mail server, and do ping checks on the router.
Edit /etc/nagios/services.cfg
define service{
use generic-service
host_name mail.yourdomain.com
service_description SMTP
is_volatile 0
check_period 24x7
max_check_attempts 3
normal_check_interval 5
retry_check_interval 1
contact_groups server_admin
notification_interval 240
notification_period 24x7
notification_options w,u,c,r
check_command check_smtp
}
define service{
use generic-service
host_name gw1.yourdomain.com
service_description PING
is_volatile 0
check_period 24x7
max_check_attempts 3
normal_check_interval 5
retry_check_interval 1
contact_groups router_admin
notification_interval 240
notification_period 24x7
notification_options w,u,c,r
check_command check_ping!100.0,20%!500.0,60%
}
And that's it. To test your configurations, you can run
nagios -v /etc/nagios/nagios.cfg
If all is well we can restart nagios and move on to the apache side
to get a visual view of the monitor.
/etc/init.d/nagios restart
Assuming you have a working apache install, you can add the apache.conf
file included in the nagios package to set up the nagios cgi administration
interface. The web interface is not required to run nagios, but it is
definitely worth setting it up. The simplest way to get it up and running
is to copy the supplied conf file over to our apache installation. On
my system, I'm running apache2. Systems running apache 1.3.xx will have
slightly different setups.
cp /etc/nagios/apache.conf /etc/apache2/sites-enabled/nagios
Of course you may want to set it up as a virtual server, but I leave
that as an exercise for the reader. Now you will want to set up an allowed
user to view the cgi interface. By default, nagios issues full administrative
access to the nagiosadmin user. Nagios uses apache htpasswd style authentication.
So here we add a user and password to the default nagios htpasswd file.
Here we add the user nagiosadmin with password mypassword to the nagios
htpasswd file.
htpasswd2 -nb nagiosadmin mypassword >> /etc/nagios/htpasswd.users
You should now be able to restart apache and logon to
http://your.nagios.server/nagios
Nagios is a very powerful tool for monitoring networks. I've only
touched on the basics here, but it should be enough to get you up and
running. Hopefully, once you do, you'll start experimenting with all
the cool features and plugins that are available. The documentation
included in the cgi interface is very detailed and helpful.
The author uses Perl for the plug-in
“Howto be a good (and lazy) System Administrator.” A couple astute
readers, after reading the article, asked if I was familiar with the
Nagios monitoring system, and I am. I've been using Nagios for a few
years now.
I had intended to write this article as a How-to on getting Nagios
configured and running for the first time. However, it turns out that
the documentation that comes with Nagios is really pretty good. And
even if you do have problems, and I did, the user community is also
quite responsive. So, rather than beating a dead horse, (with sympathy
to horse lovers) I decided to continue the Good and Lazy Administrator
Theme and discuss extending Nagios with custom service checks and custom
notifications.
Nagios uses a plug-in mechanism to implement all of it's server and
service checks as well as all of it's notifications. This is good news
for hackers, as it allows us to build new functionality that either
no one else has though of, or has need of. I wrote a couple scripts
for my Nagios system. One does a custom
service check to see if I have voicemail waiting for me at the Help
Desk, and the other does a custom notification by telephone. Before
I go on, I should give a little bit of background.
Perl plugin: check_logfiles is a plugin for Nagios which checks logfiles
for defined patterns
check_logfiles is a plugin for Nagios which checks logfiles for defined
patterns. It is capable of detecting logfile rotation. If you tell it
how the rotated archives look, it will also examine these files. Unlike
check_logfiles, traditional logfile plugins were not aware of the gap
which could occur, so under some circumstances they ignored what had
happened between their checks. A configuration file is used to specify
where to search, what to search, and what to do if a matching line is
found.
About: Nagstamon is a Nagios status monitor with a UI that
resides in the GNOME systray or on the Windows desktop. It informs you
in realtime about the status of your Nagios monitored network.
Changes: This release fixes a problem with passwords containing
special characters, and an issue where it omitted showing failed services
on hosts in scheduled downtime.
About: check_oracle_health is a plugin for the Nagios monitoring
software that allows you to monitor various metrics of an Oracle database.
It includes connection time, SGA data buffer hit ratio, SGA library
cache hit ratio, SGA dictionary cache hit ratio, SGA shared pool free,
PGA in memory sort ratio, tablespace usage, tablespace fragmentation,
tablespace I/O balance, invalid objects, and many more.
Release focus: Major feature enhancements
Changes: The tablespace-usage mode now takes into account
when tablespaces use autoextents. The data-buffer/library/dictionary-cache-hitratio
are now more accurate. Sqlplus can now be used instead of DBD::Oracle.
About: check_lm_sensors is a Nagios plugin to monitor the
values of on-board sensors and hard disk temperatures on Linux systems.
Changes: The plugin now uses the standard Nagios::Plugin CPAN
classes, fixing issues with embedded perl.
Perl plugin: check_logfiles is a plugin for Nagios which checks logfiles
for defined patterns
check_logfiles 2.3.3 (Default)
Added: Sun, Mar 12th 2006 15:09 PDT (2
years, 1 month ago)
Updated: Tue, May 6th 2008 10:37 PDT (today)
About:
check_logfiles is a plugin for Nagios which checks logfiles for defined
patterns. It is capable of detecting logfile rotation. If you tell it
how the rotated archives look, it will also examine these files. Unlike
check_logfiles, traditional logfile plugins were not aware of the gap
which could occur, so under some circumstances they ignored what had
happened between their checks. A configuration file is used to specify
where to search, what to search, and what to do if a matching line is
found.
A short, superficial intro book (190).
Killing phase from the review below: "This
is the book you should pass to your manager so (s)he understands why and
how an open solution like Nagios is the better choice and can be used for
achieving surpassing solutions. "
Warning: Several reviews of
this book looks like plants: written by the readers who has a single networking
book review or just a single review.
Spot on for a well structured book with many WOW-factors,
May 17, 2007 By
Nils Valentin (Tokyo, Japan)
-
See all my reviews
--- DISCLAIMER: This is a requested review by PTR, however any opinions
expressed within the review are my personal ones. ---
Introduction - 6p
CHAPTER 1 Best Practices - 12p
CHAPTER 2 Theory of Operations - 26p
CHAPTER 3 Installing Nagios - 11p
CHAPTER 4 Configuring Nagios - 23p
CHAPTER 5 Bootstrapping the Configs - 10p
CHAPTER 6 Watching - 46p
CHAPTER 7 Visualization - 42p
CHAPTER 8 Nagios Event Broker Interface - 19p
APPENDIX A Configure Options - 3p
APPENDIX B nagios.cfg and cgi.cfg - 9p
APPENDIX C Command-Line Options - 10p
Index - 14p
The book is with 190 pages (230p. when including appendix and
index) very compact. It teaches you Nagios in a way I have never
heard / read before. I must assume that the authors clear structured
style - which runs through the book like a red line - must be responsible
for the excellent outcome.
The book starts in the introduction with the title "Do it right the
first time" and that hits it right on the spot. What make out the features
of this little portable knowledgebase is the exceptional well thought
through contents and its explanations by the author. David is not filling
pages by explaining each and every parameter, but rather showing you
the big picture, and explaining how to approach new issues or how one
technical solution is better over another.
This is the book you should pass to your
manager so (s)he understands why and how an open solution like Nagios
is the better choice and can be used for achieving surpassing solutions.
The book itself basically is divided in two sections:
Background, setup and configuration - Chapters 1-5
Advanced Topics - Chapters 6-8
I did find any of the chapters to have a nice balance of the amount
of information needed but some EXCEPTIONAL good parts of book where:
Chapter 1 Best practices
Chapter 2 - the part about scheduling
Chapters 6-8 as a whole
Chapter 6 has a thorough explanations on monitoring the different OS's
(especially the Windows part !!) or other applications.
Chapter 7 for its overall thoroughness of how to visualize your data
to reach the next level of a better understanding of the systems / network
you are monitoring.
Chapter 8 is describing a filesystem based status interface. The NEB
module will write a file with its current status code for each service.
I have to admit that some technical details went over my head, but I
thought that was pretty cool !!
The featured points above is what I found to be exceptionally good and
most likely the strongest sales points for this little portable knowledgebase.
That doesnt mean that the other not mentioned parts of the book are
weak, mind you.
Funny enough the above mentioned points where EXACTLY the points which
I haven't seen explained this thorough anywhere before.
So David's book was exactly spot on for me.
Summary:
To sum it all up in very simple words: This is a hell of a book !!
Its the most compact, well structured book on Nagios that I have seen
to date. It contains many WOW-factors. While reading each chapter you
can virtually "feel" how Davids explanations and tips and tricks already
helped you to avoid time consuming pitfalls.
So this book is not about "to buy or not to buy", this is an investment
you dont want to miss !!
I was especially impressed by the thoroughness the book is written by
from the first page. Also the contents of the first chapter wasnt new
to me, the way it was explained already provided many of those A-ha
moments.
The main asset of the book is not the description of the tools itself,
but rather the tought and considerations the author put into it and
the sharing of those thoughts in a way that the reader can actually
visualize how and why one solution is better over another, without actually
having to go to the "luxury to experience the pitfalls" in a live disaster
scenario.
PS: AFTER I finished reading the book I re-read the "Editorial Review"
Amazon gave above and found it pretty well describing the actual book
and what you should expect.
>> You can find more reviews on Nagios related books including a comparison
by deploying my profile. <<
With the Nagios Looking Glass (NLG) tool, developer Andy Shellam
has tried to resolve a common problem for network administrators running
Nagios. What happens if you want to provide access to up-to-date information
from Nagios without giving users access to the full Nagios console?
Providing read-only access to the Nagios console can be complicated,
and can occasionally require network re-structuring or can even pose
a security risk.
NLG is designed to fix those issues by taking a feed from Nagios
status data via an HTTP connection and displaying it on a public Web
server. It works in a client-server model with a PHP-based polling server
installed on your Nagios server. A receiver client, also PHP-based,
is installed on your Web server. If you want to use NLG locally, you
can also run the client and the server together on your Nagios server.
The receiver client creates an AJAX-enabled page based on a template.
You can also customize this template to display whatever you require.
You can see a demo of NLG at http://looking-glass.andyshellam.eu/demo/.
02.05.2007
Nagios also comes with a Web-based console, extensible Nagios Event
Broker (NEB), that allows you to integrate Nagios with other tools,
like database back-ends, and a large collection of monitoring commands
and capabilities. It's current release, version 2.0, is stable and production
ready. You can take a look at Nagios at http://www.nagios.com.
Development of Nagios has not stopped with version 2.0, though. Nagios'
principal developer, Ethan Galstad, has recently released some information
on the status and potential features of the next release, version 3.0.
Galstad's announcement also suggests an alpha release of version 3.0
could be scheduled as early as the end of February 2007.
Features: What's new in Nagios 3.0
So what's new with version 3.0? Well, a lot. Let's walk through the
major new features and look at how some of Nagios' old features have
been expanded or changed.
One of the interesting features introduced
in Nagios 2.0 was adaptive monitoring. Adaptive monitoring allowed a
Nagios configuration to be changed during runtime. For example, you
can change the command being used to check a host, based on changing
conditions in your environment. In the new version, this functionality
is expanded to include the ability to change the times during which
checks are scheduled to occur. This allows you to turn on/off checks
at specific times according to conditions in your environment.
Notifications have also been enhanced, now allowing a delay to be
added to first notifications. Notifications can be generated when flapping
is disabled and, most importantly, notifications can now be sent out
when a scheduled downtime starts, ends or is cancelled.
Objects and templates haven't been forgotten either. One particularly
useful change is the ability to use multiple templates for objects.
Another is the addition of custom variables in host, service and contact
objects. Version 2.0 only allows the application of one template to
an object. Multiple templates offer greater flexibility and power, which
will make a significant difference to the configuration of objects.
Custom variables allow you to define your own directives in object
definitions and, therefore, attach additional information about an object
to its definition. These variables can be retrieved and used elsewhere
in your Nagios environment. For instance, you could define the SNMP
community strings for a host in its definition and then use these later
in a check or external command.
Other object and template changes include: merging service and host-extended
information object data into service and host object definitions, and
adding group member directives to the host and service group objects.
Enhancements to external commands are also present, including the
ability to process commands found in an external file. The suggested
use of this functionality is for passive checks with long output or
complicated scripting. A further added to Nagios 3.0 is that external
command checking is now turned on by default. In previous versions,
such checking was set off by default.
Host and service logic alterations have also been made. Most notably,
host checks now run asynchronously in parallel with each other. This
should help balance overall check performance. Another enhancement is
the ability to cache host and service check results and a function to
enable the predictive checking of dependent hosts and services.
The ability to output multiple lines of data from host and service
checks has also been added. Previously, Nagios 2.0 was limited to a
single line of output from checks, thus reducing the utility of some
checks. Now, multiple lines can be received
and processed by Nagios and the size of plug-in output has been correspondingly
increased to 2Kbs.
A number of performance optimizations have been included in Nagios
3.0, as well as enhancements to the Nagios Event Broker and the embedded
Perl interpreter. Also worth mentioning are updates to macros and to
status, comment and retention data.
"The most well-known NEB module is the NDO Utilities module.
The NDO Utilities module is written by Nagios' developer, Ethan Galstad,
and is designed to output events and data from Nagios to standard file or
a Unix socket. "
01.29.2007 | searchenterpriselinux.techtarget.com
The Nagios enterprise monitoring tool generates a variety of events.
The principal events generated are the results of monitoring applications,
databases, devices, services and hosts. Also generated is performance
data and notification events such as outages and downtime. There are
a number of ways to integrate and utilize these events. The most advanced
and effective event integration mechanism is the Nagios Event Broker
(NEB).
NEB uses callback routines that are executed when events occur in
the Nagios server. Using NEB you can write broker modules that can process
these events. NEB allows you to output and integrate events into a variety
of tools including MySQL databases, SNMP traps, syslog messages or use
the event data in a variety of other applications and tools.
Nagios Event Broker functions and triggers
NEB uses shared code libraries called modules that are hooked into
the Nagios server when it is executed. Each module can register callback
procedures that are able to receive and process events. When an event
occurs, NEB checks for the presence of a registered callback and, if
detected, sends the event to the module. The module receives the event
and performs whatever actions are coded into it.
The broker can process a large number of events including, amongst
others:
- Nagios process startup and shutdown
- Host and Service checks
- Plug-in commands and notifications
- External commands and event handlers
- Flapping, comments and downtime
You can see a full list of the callbacks in the nebcallbacks.h
include file located in the include directory of the Nagios source
package.
Enabling Nagios Event Broker
NEB should be enabled by default when you compile Nagios (unless
you disable it). If you want to ensure that NEB gets compiled then specify
the --enable-neb configure option when configuring Nagios.
# ./configure --enable-neb
Registering modules with Nagios Event Broker
Modules are included into the Nagios configuration by using broker_module
configuration options in the nagios.cfg configuration file. For
example:
broker_module=/usr/local/nagios/bin/testmodule.o
This line would load a module called testmodule.o located
in the /usr/local/nagios/bin directory. You can also specify
a configuration file for a module like so:
broker_module=/usr/local/nagios/bin/testmodule.o config_file=/usr/local/nagios/etc/testmodule.cfg
You need to restart Nagios for any newly defined modules to take
effect.
Writing modules for Nagios Event Broker
NEB Modules can be written in C or C++. You can see an example of
a module in the Nagios package. Located in the module directory off
the root of the package is the helloworld module. You can create it
by compiling the helloworld.c file.
# gcc -shared -o helloworld.o helloworld.c
You can then add this module to Nagios using the broker_module
directive in the nagios.cfg configuration file. Restart Nagios
and the module is now loaded.
The Helloworld module is extremely simple. Helloworld logs a message
to the default Nagios log file when Nagios is started and stopped and
when aggregated status updates start and finish. The message looks like:
[1137151111] helloworld: An aggregated status update just started.
[1137151112] helloworld: An aggregated status update just finished.
You can review the contents of this module (which includes some basic
inline documentation)
Available modules for Nagios Event Broker
There are not a lot of NEB modules available, so far. The most well-known
NEB module is the NDO Utilities module. The NDO Utilities module is
written by Nagios' developer, Ethan Galstad, and is designed to output
events and data from Nagios to standard file or a Unix socket. It also
comes with a module,
NDO2DB, that can write Nagios data to a MySQL or PostgreSQL database.
It should provide (together with the helloworld module) a good introduction
to NEB and help you get started on writing your own modules.
You can also find the following NEB modules:
NEB
module that logs to a socket based on client requests
A
NEB module (as yet unreleased) that does event correlation with Nagios
and SEC.
A NEB module that helps integrate Cacti with Nagios.
Further help with Nagios Event Broker
There is not a lot of documentation available for NEB thus far. The
only major piece of documentation available is about the
NEB API. You can also review the Nagios source code relevant to
NEB, particularly the include files.
As always the
Nagios development and user mailing lists are good starting places
for assistance.
Dec 05, 2005 (
Sys Admin)
In the past few years, Nagios has become the industry standard open
source systems monitoring tool. If you're using an open source app to
monitor the availability, state, or utilization of your servers or network
gear, then chances are you are using Nagios to do it. To those who have
worked with it, this is no surprise. The lightweight design of Nagios
offloads the actual query logic into "plug-ins", which are easily created,
modified, and re-purposed by sys admins. The lack of complex query logic
leaves the Nagios daemon free to manage scheduling and notifications
and to handle UI.
Nagios's "keep it simple" approach makes it straightforward to administer,
network transparent, and amazingly flexible.
Two excellent articles by Syed Ali in previous editions of
Sys Admin covered the installation and configuration of Nagios.
In this article, I'll pick up where those articles left off and provide
some creative solutions to problems commonly faced by sys admins working
with Nagios to monitor the health and performance of systems.
It is still unclear why false alerts were generated. Is
this just a plug for Hyperic ?
There was nothing technically wrong with the HP ProLiant servers at
Mynewplace.com,
an online rental services agency based in San Francisco, but the IT
staff kept on getting beeped at 4 a.m. with alerts that eventually proved
to be false alarms.
So while the servers were fine, the IT staff wasn't. Entire days
were being wasted each month diagnosing their clutch of 50 HP ProLiant
DL145s and DL385s running Red Hat Enterprise Linux 4 AS and ES, said
John Shin, Mynewplace.com's director of systems. Shin decided he needed
to make some changes. .
Struggling with network monitoring
"We were struggling with monitoring," Shin said, but that may have
been an understatement. Things were so bad, in fact, that at one point
last year he contemplated disabling the monitoring application altogether
because it was doing more harm than good.
The application was Nagios, a popular open source systems and network
monitoring application that provides alerts for user-defined hosts and
services. In Shin's network, however, it was triggering false
alarms because of simple network management protocol [SNMP] incompatibilities
with Mynewplace.com's open source application server, Resin 3.0.
Resin is based on a Java implementation of the PHP scripting language
and is maintained and supported by San Diego-based Caucho Technology
Inc.
Nagios, JVM and Resin 3.0 woes
Since Resin and Nagios were not directly compatible, Shin would expose
the application stack's Java virtual machines (JVMs) through SNMP and
monitor the environment that way. Unfortunately, response times under
those conditions were sluggish, he said.
"Nagios was not really the problem," Shin said. "It was the
JVM stack not being able to respond to it correctly. It was recording
events in SNMP that were then watched by Nagios and that made things
crawl. There were a lot of man hours wasted, and it would trigger the
4 a.m. pages."
In spite of its popularity on open source repositories like SourceForge.net,
Nagios has its detractors. In a recent interview about Nagios with SearchEnterpriseLinux.com,
Zenoss Inc. CEO Bill Karpovich criticized Nagios for its lack of
enterprise-level support. "The maintainers never thought of it
as a project that an IT manager would use to monitor an entire enterprise
environment," he said. Zenoss is an open source startup vendor in the
systems management space.
... ... ...
The feature-rich, expensive offerings from HP and the other members
of the "big four" – IBM, CA and BMC – have spawned
the "little four" (a phrase coined by analyst
firm RedMonk), comprised of Hyperic, Zenoss, Qlusters and GroundWork.
Executives from those companies have bet their chips
on the valuable midmarket for customer wins like Mynewplace.com.
Compared with OpenView, offerings from the "little four" were priced
approximately two-and-a-half times less on average, Shin found, although
he would not cite specific dollar amounts. OpenView had another strike
against it: "It did not have the framework in place to monitor some
of our key applications," namely Resin and Postgres, Shin said.
Nagios is a free, open source enterprise monitoring tool designed
to run on Linux. It has extensive monitoring and management capabilities
that allow you to check applications, databases and network devices,
as well as Windows and Unix/Linux hosts and services. It is easy to
install, fast to configure and highly customizable.
Nagios also comes with a Web-based console, extensible Nagios Event
Broker (NEB), that allows you to integrate Nagios with other tools,
like database back-ends, and a large collection of monitoring commands
and capabilities. It's current release, version 2.0, is stable and production
ready. You can take a look at Nagios at http://www.nagios.com.
Development of Nagios has not stopped with version 2.0, though. Nagios'
principal developer, Ethan Galstad, has recently released some information
on the status and potential features of the next release, version 3.0.
Galstad's announcement also suggests an alpha release of version 3.0
could be scheduled as early as the end of February 2007.
Features: What's new in Nagios 3.0
So what's new with version 3.0? Well, a lot. Let's walk through the
major new features and look at how some of Nagios' old features have
been expanded or changed.
One of the interesting features introduced in Nagios 2.0 was adaptive
monitoring. Adaptive monitoring allowed a Nagios configuration to be
changed during runtime. For example, you can change the command being
used to check a host, based on changing conditions in your environment.
In the new version, this functionality is expanded to include the ability
to change the times during which checks are scheduled to occur. This
allows you to turn on/off checks at specific times according to conditions
in your environment.
Notifications have also been enhanced, now allowing a delay to be
added to first notifications. Notifications can be generated when flapping
is disabled and, most importantly, notifications can now be sent out
when a scheduled downtime starts, ends or is cancelled.
Objects and templates haven't been forgotten either. One particularly
useful change is the ability to use multiple templates for objects.
Another is the addition of custom variables in host, service and contact
objects. Version 2.0 only allows the application of one template to
an object. Multiple templates offer greater flexibility and power, which
will make a significant difference to the configuration of objects.
Custom variables allow you to define your own directives in object
definitions and, therefore, attach additional information about an object
to its definition. These variables can be retrieved and used elsewhere
in your Nagios environment. For instance, you could define the SNMP
community strings for a host in its definition and then use these later
in a check or external command.
Other object and template changes include: merging service and host-extended
information object data into service and host object definitions, and
adding group member directives to the host and service group objects.
Enhancements to external commands are also present, including the
ability to process commands found in an external file. The suggested
use of this functionality is for passive checks with long output or
complicated scripting. A further added to
Nagios 3.0 is that external command checking is now turned on by default.
In previous versions, such checking was set off by default.
Host and service logic alterations have also been made. Most notably,
host checks now run asynchronously in parallel with each other. This
should help balance overall check performance. Another enhancement is
the ability to cache host and service check results and a function to
enable the predictive checking of dependent hosts and services.
The ability to output multiple lines of data from host and service
checks has also been added. Previously, Nagios 2.0 was limited to a
single line of output from checks, thus reducing the utility of some
checks. Now, multiple lines can be received and processed by Nagios
and the size of plug-in output has been correspondingly increased to
2Kbs.
A number of performance optimizations have been included in Nagios
3.0, as well as enhancements to the Nagios Event Broker and the embedded
Perl interpreter. Also worth mentioning are updates to macros and to
status, comment and retention data.
To see a full list of the changes, or if you wish to try Nagios 3.0
before its alpha release, you can download a current CVS snapshot from
http://www.nagios.org/development/cvs.php . The Changelog file contained
in the snapshot provides a reasonably full list of the proposed changes.
Notes:
- This is a Spartan WHYFF (We Help
You For Free) site written by people for whom English
is not a native language.
Some amount of grammar and spelling errors should be
expected.
- The site contain some broken links
as it develops like a living tree...
Please try to use Google, Open directory,
etc. to find a replacement link (see
HOWTO search the WEB for details). We would appreciate
if you can
mail us a correct link.
|
|
|
|
In case of broken links
please try to use Google search. If you find the page please notify
us about new location
Articles
HowToContactNagios
- Munin - Trac
Munin integrates perfectly with
Nagios. There are, however,
a few things of which to take notice. This article shows
example configurations and explains the communication between
the systems.
Receiving messages in
Nagios
¶
First you need a way for Nagios
to accept messages from Munin.
Nagios has exactly such a thing, namely the
NSCA which is documented
here:
http://nagios.sourceforge.net/docs/1_0/addons.html#nsca.
NSCA consists of a client
(a binary usually named send_nsca
and a server usually run from inetd. We recommend
that you enable encryption on
NSCA communication.
You also need to configure
Nagios to accept messages via
NSCA.
NSCA is, unfortunately,
not very well documented in Nagios'
official documentation. We'll cover writing the needed service
check configuration further down in this document.
Configuring
Nagios
¶
In the main config file, make sure that the command_file
directive is set and that it works. See
http://nagios.sourceforge.net/docs/2_0/configmain.html#command_file
for details.
Below is a sample extract from
nagios.cfg:
command_file=/var/run/nagios/nagios.cmd
The /var/run/nagios
directory is owned by the user
nagios runs as. The
nagios.cmd is a named
pipe on which Nagios accepts
external input.
Configuring
NSCA, server side
¶
NSCA is run through
(x)inetd. Using inetd, the below line enables
NSCA listening on port
5667:
5667 stream tcp nowait nagios /usr/sbin/tcpd /usr/sbin/nsca -c /etc/nsca.cfg --inetd
Using xinetd, the blow line enables
NSCA listening on port
5667, allowing connections only from the local host:
# description: NSCA (Nagios Service Check Acceptor)
service nsca
{
flags = REUSE
type = UNLISTED
port = 5667
socket_type = stream
wait = no
server = /usr/sbin/nsca
server_args = -c /etc/nagios/nsca.cfg --inetd
user = nagios
group = nagios
log_on_failure += USERID
only_from = 127.0.0.1
}
The file /etc/nsca.cfg
defines how NSCA behaves.
Check in particular the nsca_user
and command_file directives, these should correspond
to the file permissions and the location of the named pipe
described in nagios.cfg.
nsca_user=nagios
command_file=/var/run/nagios/nagios.cmd
Configuring
NSCA, client side
¶
The NSCA client is a
binary that submits to an NSCA
server whatever it received as arguments. Its behaviour
is controlled by the file /etc/send_nsca.cfg,
which mainly controls encryption.
You should now be able to test the communication between
the NSCA client and the
NSCA server, and consequently
whether Nagios picks up
the message. NSCA requires
a defined format for messages. For service checks, it's
like this: <host_name>[tab]<svc_description>[tab]<return_code>[tab]<plugin_output>[newline]
Below is shown how to test
NSCA.
$ /usr/sbin/send_nsca -H localhost -c /etc/send_nsca.cfg
foo.example.com test 0 0
1 data packet(s) sent to host successfully.
This caused the following to appear in /var/log/nagios/nagios.log:
[1159868622] Warning: Message queue contained results for service 'test' on host 'foo.example.com'. The service could not be found!
Messages are sent by
munin-limits based on the state of a monitored data
source: OK, Warning and Critical. Munin
does not currently support a Unknown state (This will be
fixed in the future, see
Ticket 29 for more information).
Configuring munin.conf
¶
Nagios uses the above
mentioned send_nsca
binary to send messages to Nagios.
In /etc/munin/munin.conf, enter this:
contacts nagios
contact.nagios.command /usr/bin/send_nsca -H your.nagios-host.here -c /etc/send_nsca.cfg
 |
|
Be aware that the -H switch to
send_nsca
appeared sometime after send_nsca
version 2.1. Always check send_nsca
--help! |
Configuring Munin plugins
¶
Lots of Munin plugins have (hopefully reasonable) values
for Warning and Critical levels. To set or override these,
you can change the values in
munin.conf.
Configuring
Nagios services
¶
Now Nagios needs to
recognize the messages from Munin as messages about services
it monitors. To accomplish this, every message Munin sends
to Nagios requires a matching
(passive) service defined or Nagios
will ignore the message (but it will log that something
tried).
A passive service is defined through these directives
in the proper Nagios configuration
file:
active_checks_enabled 0
passive_checks_enabled 1
A working solution is to create a template for passive
services, like the one below:
define service {
name passive-service
active_checks_enabled 0
passive_checks_enabled 1
parallelize_check 1
notifications_enabled 1
event_handler_enabled 1
register 0
is_volatile 1
}
When the template is registered, each Munin plugin should
be registered as per below:
define service {
use passive-service
host_name foo
service_description bar
check_period 24x7
max_check_attempts 3
normal_check_interval 3
retry_check_interval 1
contact_groups linux-admins
notification_interval 120
notification_period 24x7
notification_options w,u,c,r
check_command check_dummy!0
}
Notes
¶
- host_name is either the FQDN of the
host_name registered to the
Nagios plugin, or the
host alias corresponding to Munin's
NSCA_Setup
nagios-yum-cfengine
- Paperback: 464 pages
- Publisher: No Starch Press;
U.S. Ed edition (May 30, 2006)
- Language: English
- ISBN-10: 1593270704
- ISBN-13: 978-1593270704
- Product Dimensions: 9.2 x 7
x 1.1 inches
Best for Nagios admins who want specific details on plug-ins,
September 4, 2006
I recently received review copies of Pro Nagios 2.0 (PN2) by James Turnbull
and Nagios: System and Network Monitoring (NSANM) by Wolfgang Barth.
I read PN2 first, then NSANM. Both are excellent books, but I expect
potential readers want to know which is best for them. The following
is a radical simplification, and I could honestly recommend readers
buy either (or both) books. If you are completely new to Nagios and
want a very well-organized introduction, I recommend PN2. If you are
somewhat familiar with Nagios and want detailed descriptions of a wide
variety of Nagios plug-ins, I recommend NSANM.
NSANM strengths lie in the depth of coverage of certain elements when
compared to PN2. PN2 devotes 7 pages to host checks, while NSANM's Ch
7 offers 21 pages. PN2 supplies 8 pages on service checks, but NSANM's
Ch 6 gives 46 pages. This level of detail can be very useful. For example,
NSANM's explanation of check_squid also shows to to configure Sguid
to allow access to its cache manager.
NSANM shares more information on certain background protocols like SNMP.
PN2's SNMP section is about 7 pages, whereas NSANM's Ch 11 is 36 pages.
NSANM demonstrates more aspects of Nagios' Web interface and the CGI
programs generating pages. I thought author Wolfgang Barth made very
effective use of diagrams, like the network topology explanation in
Ch 4, the service checks in Ch 5, and notification in Ch 12.
NSANM includes some material not mentioned in PN2, like using Nagios
with Cygwin. Sometimes the books are very complementary, as shown by
PN2's discussion of NSClient++ and NSANM's overview of NSClient and
NC_Net.
NSANM is lacking coverage of security, redundancy, and failover, however.
PN2 does address these critical issues. Beware the some of the "chapters"
in NSANM are very short -- like Ch 8 (2 pages!) and Ch 19 (barely 6
pages). I think short sections like those should have been integrated
into longer chapters or moved into the appendices.
Overall, NSANM is a very good book. I believe new Nagios readers should
read PN2, and strongly consider NSANM as a complementary reference volume.
- Hardcover: 424 pages
- Publisher: Apress (April 17, 2006)
- Language: English
- ISBN-10: 1590596099
- ISBN-13: 978-1590596098
- Product Dimensions: 9.3 x 7.1 x 1.1 inches
A short, superficial into book (190).
Killing phaze from the review below: "This is the book you should
pass to your manager so (s)he understands why and how an open solution like
Nagios is the better choice and can be used for achieving surpassing solutions.
"
Warning: Several reviews of
this book looks like plants: written by the author who has a single networking
book review or just a single review.
Spot on for a well structured book with many WOW-factors,
May 17, 2007
By |
Nils
Valentin (Tokyo, Japan) -
See all my reviews
|
--- DISCLAIMER: This is a requested review by PTR, however any opinions
expressed within the review are my personal ones. ---
Introduction - 6p
CHAPTER 1 Best Practices - 12p
CHAPTER 2 Theory of Operations - 26p
CHAPTER 3 Installing Nagios - 11p
CHAPTER 4 Configuring Nagios - 23p
CHAPTER 5 Bootstrapping the Configs - 10p
CHAPTER 6 Watching - 46p
CHAPTER 7 Visualization - 42p
CHAPTER 8 Nagios Event Broker Interface - 19p
APPENDIX A Configure Options - 3p
APPENDIX B nagios.cfg and cgi.cfg - 9p
APPENDIX C Command-Line Options - 10p
Index - 14p
The book is with 190 pages (230p. when including appendix and
index) very compact. It teaches you Nagios in a way I have never
heard / read before. I must assume that the authors clear structured
style - which runs through the book like a red line - must be responsible
for the excellent outcome.
The book starts in the introduction with the title "Do it right the
first time" and that hits it right on the spot. What make out the features
of this little portable knowledgebase is the exceptional well thought
through contents and its explanations by the author. David is not filling
pages by explaining each and every parameter, but rather showing you
the big picture, and explaining how to approach new issues or how one
technical solution is better over another.
This is the book you should pass to your
manager so (s)he understands why and how an open solution like Nagios
is the better choice and can be used for achieving surpassing solutions.
The book itself basically is divided in two sections:
Background, setup and configuration - Chapters 1-5
Advanced Topics - Chapters 6-8
I did find any of the chapters to have a nice balance of the amount
of information needed but some EXCEPTIONAL good parts of book where:
Chapter 1 Best practices
Chapter 2 - the part about scheduling
Chapters 6-8 as a whole
Chapter 6 has a thorough explanations on monitoring the different OS's
(especially the Windows part !!) or other applications.
Chapter 7 for its overall thoroughness of how to visualize your data
to reach the next level of a better understanding of the systems / network
you are monitoring.
Chapter 8 is describing a filesystem based status interface. The NEB
module will write a file with its current status code for each service.
I have to admit that some technical details went over my head, but I
thought that was pretty cool !!
The featured points above is what I found to be exceptionally good and
most likely the strongest sales points for this little portable knowledgebase.
That doesnt mean that the other not mentioned parts of the book are
weak, mind you.
Funny enough the above mentioned points where EXACTLY the points which
I haven't seen explained this thorough anywhere before.
So David's book was exactly spot on for me.
Summary:
To sum it all up in very simple words: This is a hell of a book !!
Its the most compact, well structured book on Nagios that I have seen
to date. It contains many WOW-factors. While reading each chapter you
can virtually "feel" how Davids explanations and tips and tricks already
helped you to avoid time consuming pitfalls.
So this book is not about "to buy or not to buy", this is an investment
you dont want to miss !!
I was especially impressed by the thoroughness the book is written by
from the first page. Also the contents of the first chapter wasnt new
to me, the way it was explained already provided many of those A-ha
moments.
The main asset of the book is not the description of the tools itself,
but rather the tought and considerations the author put into it and
the sharing of those thoughts in a way that the reader can actually
visualize how and why one solution is better over another, without actually
having to go to the "luxury to experience the pitfalls" in a live disaster
scenario.
PS: AFTER I finished reading the book I re-read the "Editorial Review"
Amazon gave above and found it pretty well describing the actual book
and what you should expect.
>> You can find more reviews on Nagios related books including a comparison
by deploying my profile. <<
Copyright © 1996-2009 by Dr. Nikolai Bezroukov.
www.softpanorama.org was
created as a service to the UN Sustainable Development Networking Programme (SDNP)
in the author free time.
Submit
comments This document is an industrial compilation designed and created
exclusively for educational use and is placed under the copyright of the
Open Content License(OPL).
Site uses AdSense so you need to be aware of Google privacy policy. Original materials copyright belong to respective owners. Quotes are made
for educational purposes only in compliance with the fair use doctrine.
Disclaimer:
- The statements, views and opinions presented on
this web page are those of the author and are not endorsed by, nor do they necessarily
reflect, the opinions of the author present and former employers, SDNP or any other
organization the author may be associated with.
- We do not warrant the correctness of the information provided or its
fitness for any purpose
- In no way this site is associated with or endorse cybersquatters
using
the term "softpanorama" with other main or country domains (e.g. softpanorama.com) with
bad faith intent to profit from the goodwill belonging to
someone else.
Last modified:
October 18, 2009