Uploaded image for project: 'Observium'
  1. Observium
  2. OBS-2464

r8892 Appears to Break Uptime Monitoring on Linux Systems

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • None
    • Professional Edition
    • Poller
    • None
    • SLES 11/12 and RHEL 7

    Description

      Hi,

      r8892 breaks uptime monitoring on Linux servers, as far as we can see. I've just updated to 8893 and received a device rebooted alert for 30+ servers, and for those servers, the reported uptime is wrong. Seems to be newer distros only. Old CentOS boxes aren't affected, but new RHEL and SLES ones are.

      SVN log shows:
      r8892 | mike | 2017-10-12 14:11:41 +0100 (Thu, 12 Oct 2017) | 2 lines

      [MINOR] Prioritizing snmpEngineTime over hrSystemUptime and sysUptime. Clean old geolocation parts.

      This seems to be the wrong thing to do for Linux systems, because only hrSystemUptime seems to report the correct system uptime, as reported by the "uptime" command.

      For example, we've a server that has been up for 15 days, 22 hours.
      $ uptime
      14:43pm up 15 days 22:39, 1 user, load average: 0.02, 0.05, 0.01

      Observium's device page now reports it as Uptime 3h 50m 35s

      snmpwalk shows this:
      DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (1300251) 3:36:42.51
      SNMP-FRAMEWORK-MIB::snmpEngineTime.0 = INTEGER: 13357 seconds
      HOST-RESOURCES-MIB::hrSystemUptime.0 = Timeticks: (137807404) 15 days, 22:47:54.04

      Please can this priority be fixed.

      Thanks,

      Steven

      Attachments

        Activity

          [OBS-2464] r8892 Appears to Break Uptime Monitoring on Linux Systems
          landy Mike Stupalov added a comment - - edited

          Uptime must be faithful (in your case from hrSystemUptime.0)..

          Since was troubles with detect uptime some time ago, this is normal for some incorrect reboot triggers too.

          But you still see reboot event logs for this devices (how often)?

          landy Mike Stupalov added a comment - - edited Uptime must be faithful (in your case from hrSystemUptime.0).. Since was troubles with detect uptime some time ago, this is normal for some incorrect reboot triggers too. But you still see reboot event logs for this devices (how often)?

          Looking at this some more, it seems that uptime reporting in the device view is now correct for all affected devices, but something's still wrong with the logic behind "has the device rebooted" for the alert check.

          stevenr Steven Robson added a comment - Looking at this some more, it seems that uptime reporting in the device view is now correct for all affected devices, but something's still wrong with the logic behind "has the device rebooted" for the alert check.

          This doesn't appear to be fixed on r8910. We get multiple "device rebooted" alerts per day for most Linux servers, and on some of them, extending snmpd isn't an option (Linux-based appliances). Only Linux devices seem to be affected. 

          One such server:

          HOST-RESOURCES-MIB::hrSystemUptime.0 = Timeticks: (2114597183) 244 days, 17:52:51.83

          SNMPv2-SMI::enterprises.15601.1.1.0 = Timeticks: (2114597864) 244 days, 17:52:58.64

          SNMP-FRAMEWORK-MIB::snmpEngineBoots.0 = INTEGER: 287
          SNMP-FRAMEWORK-MIB::snmpEngineTime.0 = INTEGER: 3869 seconds

          DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (385986) 1:04:19.86

          Uptime is reported correctly in the web interface:

          Uptime 244 days, 17h 59m 15s

           

          However, for other servers, it's reported incorrectly, e.g:

          Observium Alert Uptime 1h 39s 

          16:34pm  up 90 days 18:56,  1 user,  load average: 0.11, 0.09, 0.09

          HOST-RESOURCES-MIB::hrSystemUptime.0 = Timeticks: (784425551) 90 days, 18:57:35.51

          SNMPv2-SMI::enterprises.15601.1.1.0 = Timeticks: (784427721) 90 days, 18:57:57.21

          SNMP-FRAMEWORK-MIB::snmpEngineBoots.0 = INTEGER: 909
          SNMP-FRAMEWORK-MIB::snmpEngineTime.0 = INTEGER: 3960 seconds

          DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (392875) 1:05:28.75

           

          But Observium thinks the server has recently rebooted in both cases - we got an alert and it's shown on the Observium homepage. 

          The alert check is "device_rebooted eq 1" (also tried ne 0).

           

           

          stevenr Steven Robson added a comment - This doesn't appear to be fixed on r8910. We get multiple "device rebooted" alerts per day for most Linux servers, and on some of them, extending snmpd isn't an option (Linux-based appliances). Only Linux devices seem to be affected.  One such server: HOST-RESOURCES-MIB::hrSystemUptime.0 = Timeticks: (2114597183) 244 days, 17:52:51.83 SNMPv2-SMI::enterprises.15601.1.1.0 = Timeticks: (2114597864) 244 days, 17:52:58.64 SNMP-FRAMEWORK-MIB::snmpEngineBoots.0 = INTEGER: 287 SNMP-FRAMEWORK-MIB::snmpEngineTime.0 = INTEGER: 3869 seconds DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (385986) 1:04:19.86 Uptime is reported correctly in the web interface: Uptime 244 days, 17h 59m 15s   However, for other servers, it's reported incorrectly, e.g: Observium Alert Uptime 1h 39s  16:34pm  up 90 days 18:56,  1 user,  load average: 0.11, 0.09, 0.09 HOST-RESOURCES-MIB::hrSystemUptime.0 = Timeticks: (784425551) 90 days, 18:57:35.51 SNMPv2-SMI::enterprises.15601.1.1.0 = Timeticks: (784427721) 90 days, 18:57:57.21 SNMP-FRAMEWORK-MIB::snmpEngineBoots.0 = INTEGER: 909 SNMP-FRAMEWORK-MIB::snmpEngineTime.0 = INTEGER: 3960 seconds DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (392875) 1:05:28.75   But Observium thinks the server has recently rebooted in both cases - we got an alert and it's shown on the Observium homepage.  The alert check is "device_rebooted eq 1" (also tried ne 0).    

          Fixed in r8897.

          landy Mike Stupalov added a comment - Fixed in r8897.

          I've added the ability to inject long-term accurate uptime into net-snmp (we already do this via the unix-agent, but not everyone likes using that!)

          If you add this to your snmpd.conf: 

          extend uptime /bin/cat /proc/uptime

          Observium will pick it up during polling and use it in preference to everything else (except unix-agent).

          adama Adam Armstrong added a comment - I've added the ability to inject long-term accurate uptime into net-snmp (we already do this via the unix-agent, but not everyone likes using that!) If you add this to your snmpd.conf:  extend uptime /bin/cat /proc/uptime Observium will pick it up during polling and use it in preference to everything else (except unix-agent).

          This is actually because it's now using the uptime of the snmp daemon, which I assume has been restarted or updated since those systems were last booted. It does indeed seem to be mostly the wrong thing to do in this instance.

          Note that on Linux systems the uptime counter from hrSystemUptime will wrap after 450 or so days in any event

          adama Adam Armstrong added a comment - This is actually because it's now using the uptime of the snmp daemon, which I assume has been restarted or updated since those systems were last booted. It does indeed seem to be mostly the wrong thing to do in this instance. Note that on Linux systems the uptime counter from hrSystemUptime will wrap after 450 or so days in any event

          People

            landy Mike Stupalov
            stevenr Steven Robson
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: