Traphandling mit snmptt, Nagios und Wrapperscripts
Transcription
Traphandling mit snmptt, Nagios und Wrapperscripts
SNMP Traphandling für Nagios mit snmptt und Shellscripts Martin Fürstenau [email protected] 1 September 22, 2006 SNMP Trap Handling für Nagios Methode “Nagios” ■ Voraussetzung: snmptrapd muss installiert sein ■ Für jeden Trap ist in /etc/snmptrapd.conf ein handle zu definieren mit folgender Syntax: traphandle OID Eventhandler/Programm ■ Die Handler übergeben ihr Erbebnis an NSCA ■ Für jeden Handle ist ein Service Check zu übergeben ■ Fazit ■ Aufwendig ■ Schwer zu warten ■ Unflexibel 2 September 22, 2006 SNMP Trap Handling für Nagios Besser mit snmptt (http://www.snmptt.org) ■ Voraussetzung: ■ snmptrapd muss installiert sein ■ snmptt muss installiert sein ■ Konfigurationsdateien f. snmptt werden per snmpttconvertmib aus MIBs generiert. ■ Der Eventhandler submit_check_result sollte installiert sein ■ Nachrichtenfluss: ■ snmptrapd übergibt an snmptt ■ snmptt bereitet den Trap auf und ruft ■ einen Eventhandler (submit_check_result) auf ■ oder ein Script, das seinerseits einen Eventhandler (submit_check_result) aufruft ■ der Eventhandler übermittelt das als passiven Check an Nagios 3 September 22, 2006 SNMP Trap Handling für Nagios submit_check_result?? (1) #!/bin/sh # SUBMIT_CHECK_RESULT # Written by Ethan Galstad ([email protected]) # Last Modified: 02-18-2002 # # This script will write a command to the Nagios command # file to cause Nagios to process a passive service check # result. Note: This script is intended to be run on the # same host that is running Nagios. If you want to # submit passive check results from a remote machine, look # at using the nsca addon. # # Arguments: # $1 = host_name (Short name of host that the service is # associated with) # $2 = svc_description (Description of the service) # $3 = return_code (An integer that determines the state # of the service check, 0=OK, 1=WARNING, 2=CRITICAL, # 3=UNKNOWN). # $4 = plugin_output (A text string that should be used # as the plugin output for the service check) 4 September 22, 2006 SNMP Trap Handling für Nagios submit_check_result?? (2) echocmd="/bin/echo" CommandFile="/var/spool/nagios/nagios.cmd" # get the current date/time in seconds since UNIX epoch datetime=`date +%s` # create the command line to add to the command file cmdline="[$datetime] PROCESS_SERVICE_CHECK_RESULT;$1;$2;$3;$4" # append the command to the end of the command file `$echocmd $cmdline >> $CommandFile` 5 September 22, 2006 SNMP Trap Handling für Nagios Konfiguration ■ snmptrapd.conf: ■ traphandle default /usr/sbin/snmptthandler ■ snmptt.ini nach Handbuch ■ Wesentlich d. letzte Abschnitt: [TrapFiles] # A list of snmptt.conf files (this is NOT the snmptrapd.conf file). The COMPLETE path # and filename. Ex: '/etc/snmp/snmptt.conf' snmptt_conf_files = <<END /etc/snmptt/snmptt.conf.apx END 6 September 22, 2006 SNMP Trap Handling für Nagios Beispiel Szenario APX ■ 1 W2K-Server (1 x produktiv, 1 x icecold standby) Poing ■ 1 Testserver Venlo NL ■ 4 Traps ■ Start ■ Stop ■ Heartbeat ■ Alle abgebrochenen Jobs und ihre Probleme ■ Nur Traps vom produktiven Server sollen akzeptiert werden ■ Inhalte wie z.B. Abbruchcode, Severity etc. im Messagestring des Traps 7 September 22, 2006 SNMP Trap Handling für Nagios Konvertieren d. benötigten MIBS mit snmpttconvertmib ■ Beispielergebnis /etc/snmptt/snmptt.conf.apx (1): # MIB: APX-MIB (file:./apmAPX.mib) converted on Fri Nov 4 16:00:13 2005 using snmpttconvertmib v1.0 # EVENT apxControlStart .1.3.6.1.4.1.1024.2.24.11.10.0.1081 "Status Events" Normal FORMAT Control started $* EXEC /usr/lib/nagios/ops_plugins/apx_start_trap_wrapper $r SDESC Control started Variables: 1: apxControlState 2: apxTrapID EDESC # 8 September 22, 2006 SNMP Trap Handling für Nagios Konvertieren d. benötigten MIBS mit snmpttconvertmib ■ Beispielergebnis /etc/snmptt/snmptt.conf.apx (2): # EVENT apxControlShutdown .1.3.6.1.4.1.1024.2.24.11.10.0.1082 "Status Events" Normal FORMAT Control shutdown $* EXEC /usr/lib/nagios/ops_plugins/apx_stop_trap_wrapper $r SDESC Control shutdown Variables: 1: apxControlState 2: apxTrapID EDESC # 9 September 22, 2006 SNMP Trap Handling für Nagios Konvertieren d. benötigten MIBS mit snmpttconvertmib ■ Beispielergebnis /etc/snmptt/snmptt.conf.apx (3): # EVENT apxHeartbeat .1.3.6.1.4.1.1024.2.24.11.10.0.1088 "Status Events" Normal FORMAT Control heartbeat $* EXEC /usr/lib/nagios/ops_plugins/apx_heartbeat_trap_wrapper $r SDESC Control heartbeat Variables: 1: apxControlState 2: apxLastAction 3: apxTrapID EDESC # 10 September 22, 2006 SNMP Trap Handling für Nagios Konvertieren d. benötigten MIBS mit snmpttconvertmib ■ Beispielergebnis /etc/snmptt/snmptt.conf.apx (4): # EVENT apxJobCancelled .1.3.6.1.4.1.1024.2.24.11.11.0.2004 "Status Events" Normal FORMAT Job cancelled $* EXEC /usr/lib/nagios/ops_plugins/apx_service_trap_wrapper $r $3 $5 $2 $7 $1 SDESC Job cancelled Variables: 1: apxTrapID 2: apxJobkey 3: apxJobname 4: apxJobOID 5: apxCondCode 6: apxJobstate 7: apxGroup EDESC 11 September 22, 2006 SNMP Trap Handling für Nagios Die Logfiles (1) ■ snmptrapd.log 2006-09-03 08:32:50 ops-apx01.ops.oce.net [10.53.11.121] (via UDP: [10.53.11.121]:1048) TRAP, SNMP v1, community OCE_PO . 1.3.6.1.4.1.1024.2.24.11.10 Enterprise Specific Trap (1081) Uptime: 450 days, 4:48:25.32 .1.3.6.1.4.1.1024.2.24.11.1.1 = STRING: "1081" . 1.3.6.1.4.1.1024.2.24.11.1.201 = STRING: "1,5,0,141" . 1.3.6.1.4.1.1024.2.24.11.1.202 = STRING: "RUN" . 1.3.6.1.4.1.1024.2.24.11.1.203 = STRING: "03.09.2006 08:32:52" ■ snmptt.log Sun Sep 3 08:32:50 2006 .1.3.6.1.4.1.1024.2.24.11.10.0.1081 Normal "Status Events" ops-apx01 - Control started 1081 1,5,0,141 RUN 03.09.2006 08:32:52 12 September 22, 2006 SNMP Trap Handling für Nagios Die Logfiles (2) ■ snmptrapd.log 2006-09-03 08:32:57 ops-apx01.ops.oce.net [10.53.11.121] (via UDP: [10.53.11.121]:1067) TRAP, SNMP v1, community OCE_PO .1.3.6.1.4.1.1024.2.24.11.10 Enterprise Specific Trap (1088) Uptime: 450 days, 4:48:32.93 .1.3.6.1.4.1.1024.2.24.11.1.1 = STRING: "1088" . 1.3.6.1.4.1.1024.2.24.11.1.201 = STRING: "1,5,0,141" . 1.3.6.1.4.1.1024.2.24.11.1.202 = STRING: "RUN" . 1.3.6.1.4.1.1024.2.24.11.1.203 = STRING: "03.09.2006 08:32:59" ■ snmptt.log Sun Sep 3 08:32:58 2006 .1.3.6.1.4.1.1024.2.24.11.10.0.1088 Normal "Status Events" ops-apx01 - Control heartbeat 1088 1,5,0,141 RUN 03.09.2006 08:32:59 13 September 22, 2006 SNMP Trap Handling für Nagios Die Logfiles (3) ■ snmptrapd.log 2006-09-06 12:15:33 ops-apx01.ops.oce.net [10.53.11.121] (via UDP: [10.53.11.121]:3344) TRAP, SNMP v1, community OCE_PO .1.3.6.1.4.1.1024.2.24.11.11 Enterprise Specific Trap (2004) Uptime: 453 days, 8:31:06.25 .1.3.6.1.4.1.1024.2.24.11.1.1 = STRING: "2004" . 1.3.6.1.4.1.1024.2.24.11.1.101 = STRING: "1858194" . 1.3.6.1.4.1.1024.2.24.11.1.102 = STRING: "RP651700_1" . 1.3.6.1.4.1.1024.2.24.11.1.103 = STRING: "0x081190973FFC4642B68B026E2402423D" .1.3.6.1.4.1.1024.2.24.11.1.104 = STRING: "ABAP" .1.3.6.1.4.1.1024.2.24.11.1.106 = STRING: "?" . 1.3.6.1.4.1.1024.2.24.11.1.107 = STRING: "MNLsev1" ■ snmptt.log Wed Sep 6 12:15:33 2006 .1.3.6.1.4.1.1024.2.24.11.11.0.2004 Normal "Status Events" ops-apx01 - Job cancelled 2004 1858194 RP651700_1 0x081190973FFC4642B68B026E2402423D ABAP ? MNLsev1 14 September 22, 2006 SNMP Trap Handling für Nagios Define service (siehe snmptt Doku) define service{ host_name service_description Description server01 Name of host TRAP Name of service. What you use here must match the same value for the <submit_check_result> script is_volatile 1 Enables volatile services check_command check-host-alive Used to reset the status to OK when schedule an immediate check of this service is selected max_check_attempts 1 Leave as 1. normal_check_interval 1 Leave as 1 retry_check_interval 1 Leave as 1 passive_checks_enabled 1 Enables passive checks check_period none When this servcie can be checked. Because it is a passive service, it never needs to be automatically checked notification_interval 31536000 Notification interval. Set to a very high number to prevent you from getting pages of previously received traps (1 year - restart Nagios at least once a year! - do not set to 0!). notification_period 24x7 When you can be notified. Can be changed notification_options w,u,c,r Notify on warning, unknown, critical and recovery notifications_enabled 1 Enable notifications contact_groups cg_core Name of contact group to notify 15 September 22, 2006 SNMP Trap Handling für Nagios Definition Service Template (im Beispiel f.APX) define service{ name APX-SERVER register 0 host_name ops-apx01 passive_checks_enabled 1 notifications_enabled 1 is_volatile 1 check_period none check_freshness 1 max_check_attempts 1 normal_check_interval 1 flap_detection_enabled 0 retry_check_interval 1 contact_groups apx-admins,apx-admins_sms notification_interval 31536000 notification_period 24x7 } define service{ name APX-TRAP use APX-SERVER register 0 # Nach 5 Stunden werden Traps v. Service Checks zureuckgesetzt freshness_threshold 18000 check_command apx_service_trap_reset notification_options c,w } 16 September 22, 2006 SNMP Trap Handling für Nagios Definition Service (im Beispiel f.APX) define service{ use service_description } 17 APX-TRAP RP818100 September 22, 2006 SNMP Trap Handling für Nagios Beispiel Wrapper: apx_heartbeat_trap_wrapperefinition (1) #!/bin/bash # Author: M.Fuerstenau, OPS, Poing # Purpose: # Will ge startet from snmptt (Config-file /etc/snmptt/snmptt.conf.apx) # Instead of being started directly from the mentioned config file # the use of this wrapper was necessary due to the fact that the # service name for the Nagios service has to be generated and # that strings like "APXsev1" must be mapped to the appropriate # Nagios return code HOST=$1 NAGIOS_DIR=/etc/nagios # Pruefen, ob es den sendenden Host ueberhaupt gibt. Sonst raus. grep ^$HOST$ $NAGIOS_DIR/hosts/* 1>&2 > /dev/null RETCO=$? if [ $RETCO -ne 0 ] then exit 1 fi 18 September 22, 2006 SNMP Trap Handling für Nagios Beispiel Wrapper: apx_heartbeat_trap_wrapperefinition (2) # IP des produktiven Host ermitteln PROD_HOST=$(grep host_name $NAGIOS_DIR/services/APX-TRAP-template.cfg | awk '{print $2}') HOST_PROD_CFG=$(grep ^$PROD_HOST$ $NAGIOS_DIR/hosts/* | grep host_name | awk '{print $1}' | sed 's/:$//') PROD_IP=$(grep address $HOST_PROD_CFG | grep -v '#' | awk '{print $2}') # IP des anfragenden Hosts ermitteln (da der mal mit Hostname (Poing) oder IP (Venlo) daherkummt) STR_2_GREP=$(grep ^$HOST$ $NAGIOS_DIR/hosts/* | grep -v alias |awk '{print $2}') HOST_ASK_CFG=$(grep ^$HOST$ $NAGIOS_DIR/hosts/* | grep $STR_2_GREP | awk '{print $1}' | sed 's/:$//') ASK_IP=$(grep address $HOST_ASK_CFG | grep -v '#' | awk '{print $2}') if [ "$PROD_IP" = "$ASK_IP" ] then /usr/lib/nagios/eventhandlers/submit_check_result $HOST APX_Heartbeat 0 "APX is running" exit 0 else exit 1 fi 19 September 22, 2006 SNMP Trap Handling für Nagios Beispiel (Re)set: apx_heartbeat_alarm_set #!/bin/sh HOST=$1 /usr/lib/nagios/eventhandlers/submit_check_result $HOST APX_Heartbeat 2 "Heartbeat signal from APX is missing. Maybe system is not running!" exit 2 20 September 22, 2006 SNMP Trap Handling für Nagios Beispiel Wrapper: apx_service_trap_wrapper (1) #!/bin/bash # Author: M.Fuerstenau, OPS, Poing # 18.11.2005 # # Purpose: # Will ge startet from snmptt (Config-file /etc/snmptt/snmptt.conf.apx) # Instead of being started directly from the mentioned config file # the use of this wrapper was necessary due to the fact that the # service name for the Nagios service has to be generated and # that strings like "APXsev1" must be mapped to the appropriate # Nagios return code # Anzahl uebergebener Argumente pruefen if [ $# -ne 6 ] then echo $* echo "Wrong number of arguments - exiting" exit 1 fi 21 September 22, 2006 SNMP Trap Handling für Nagios Beispiel Wrapper: apx_service_trap_wrapper (2) n # Uebergebene Argumente Variablen zuordne HOST=$1 APX_JOB_NAME=$2 APX_COND_CODE=$3 APX_JOB_KEY=$4 APX_GROUP=$5 APX_TRAP=$6 # Globale Variablen setzen NAGIOS_DIR=/etc/nagios HAMLET=0 # Pruefen, ob es den sendenden Host ueberhaupt gibt. Sonst raus. grep ^$HOST$ $NAGIOS_DIR/hosts/* 1>&2 > /dev/null RETCO=$? 22 September 22, 2006 SNMP Trap Handling für Nagios Beispiel Wrapper: apx_service_trap_wrapper (3) if [ $RETCO -ne 0 ] then exit 1 fi # IP des produktiven Host ermitteln PROD_HOST=$(grep host_name $NAGIOS_DIR/services/APX-TRAP-template.cfg | awk '{print $2}') HOST_PROD_CFG=$(grep ^$PROD_HOST$ $NAGIOS_DIR/hosts/* | grep host_name | awk '{print $1}' | sed 's/:$//') PROD_IP=$(grep address $HOST_PROD_CFG | grep -v '#' | awk '{print $2}') # IP des anfragenden Hosts ermitteln (da der mal mit Hostname (Poing) oder IP (Venlo) daherkummt) STR_2_GREP=$(grep ^$HOST$ $NAGIOS_DIR/hosts/* | grep -v alias |awk '{print $2}') HOST_ASK_CFG=$(grep ^$HOST$ $NAGIOS_DIR/hosts/* | grep $STR_2_GREP | awk '{print $1}' | sed 's/:$//') ASK_IP=$(grep address $HOST_ASK_CFG | grep -v '#' | awk '{print $2}') 23 September 22, 2006 SNMP Trap Handling für Nagios Beispiel Wrapper: apx_service_trap_wrapper (4) if [ $RETCO -ne 0 ] then exit 1 fi # IP des produktiven Host ermitteln PROD_HOST=$(grep host_name $NAGIOS_DIR/services/APX-TRAP-template.cfg | awk '{print $2}') HOST_PROD_CFG=$(grep ^$PROD_HOST$ $NAGIOS_DIR/hosts/* | grep host_name | awk '{print $1}' | sed 's/:$//') PROD_IP=$(grep address $HOST_PROD_CFG | grep -v '#' | awk '{print $2}') # IP des anfragenden Hosts ermitteln (da der mal mit Hostname (Poing) oder IP (Venlo) daherkummt) STR_2_GREP=$(grep ^$HOST$ $NAGIOS_DIR/hosts/* | grep -v alias |awk '{print $2}') HOST_ASK_CFG=$(grep ^$HOST$ $NAGIOS_DIR/hosts/* | grep $STR_2_GREP | awk '{print $1}' | sed 's/:$//') ASK_IP=$(grep address $HOST_ASK_CFG | grep -v '#' | awk '{print $2}') 24 September 22, 2006 SNMP Trap Handling für Nagios Beispiel Wrapper: apx_service_trap_wrapper (5) # Katastophe oder Warnung - dass ist hier die Frage case "$APX_GROUP" in APXsev1) HAMLET=2 ;; APXsev2) HAMLET=1 ;; MNLsev1) HAMLET=2 ;; MNLsev2) HAMLET=1 ;; EUROCEsev1) HAMLET=2 ;; EUROCEsev2) HAMLET=1 ;; SMSTESTsev1) HAMLET=2 ;; SMSTESTsev2) HAMLET=1 ;; *) echo "Unknown groupcode - babe, it is impossible" esac 25 September 22, 2006 SNMP Trap Handling für Nagios Beispiel Wrapper: apx_service_trap_wrapper (6) # # Pruefen of der den Trap sendende Host gleich dem ueberwachten ist. Wenn ja, weiter, sonst raus if [ "$PROD_IP" = "$ASK_IP" ] then /usr/lib/nagios/eventhandlers/submit_check_result $HOST $APX_JOB_NAME $HAMLET "Job $APX_JOB_NAME cancelled - Condition Code: $APX_COND_CODE APX JobName $APX_JOB_KEY - APX Group $APX_GROUP - APX Trap ID $APX_TRAP" exit 0 else exit 1 fi 26 September 22, 2006 SNMP Trap Handling für Nagios Beispiel Reset: apx_service_trap_reset #!/bin/sh echo "Ok. No error. No warning." exit 0 27 September 22, 2006 SNMP Trap Handling für Nagios Beispiel: Start und Stop ■ Auszug aus apx_start_trap_wrapper: if [ "$PROD_IP" = "$ASK_IP" ] then /usr/lib/nagios/eventhandlers/submit_check_result $HOST APX_Start_Stop 0 "APX started" exit 0 else exit 1 fi ■ Auszug aus apx_stop_trap_wrapper: if [ "$PROD_IP" = "$ASK_IP" ] then /usr/lib/nagios/eventhandlers/submit_check_result $HOST APX_Start_Stop 2 "APX shutdown" exit 0 else exit 1 fi 28 September 22, 2006