[OBS-2464] r8892 Appears to Break Uptime Monitoring on Linux Systems - Observium

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: None
Affects Version/s: Professional Edition
Component/s: Poller
Labels:
None
Environment:
SLES 11/12 and RHEL 7

Description

Hi,

r8892 breaks uptime monitoring on Linux servers, as far as we can see. I've just updated to 8893 and received a device rebooted alert for 30+ servers, and for those servers, the reported uptime is wrong. Seems to be newer distros only. Old CentOS boxes aren't affected, but new RHEL and SLES ones are.

SVN log shows:
r8892 | mike | 2017-10-12 14:11:41 +0100 (Thu, 12 Oct 2017) | 2 lines

[MINOR] Prioritizing snmpEngineTime over hrSystemUptime and sysUptime. Clean old geolocation parts.

This seems to be the wrong thing to do for Linux systems, because only hrSystemUptime seems to report the correct system uptime, as reported by the "uptime" command.

For example, we've a server that has been up for 15 days, 22 hours.
$ uptime
14:43pm up 15 days 22:39, 1 user, load average: 0.02, 0.05, 0.01

Observium's device page now reports it as Uptime 3h 50m 35s

snmpwalk shows this:
DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (1300251) 3:36:42.51
SNMP-FRAMEWORK-MIB::snmpEngineTime.0 = INTEGER: 13357 seconds
HOST-RESOURCES-MIB::hrSystemUptime.0 = Timeticks: (137807404) 15 days, 22:47:54.04

Please can this priority be fixed.

Thanks,

Steven

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

screenshot-1.png
18 kB
2017/10/20 02:15 AM

Activity

[OBS-2464] r8892 Appears to Break Uptime Monitoring on Linux Systems

Mike Stupalov added a comment - 2017/10/17 09:49 PM - edited

Uptime must be faithful (in your case from hrSystemUptime.0)..

Since was troubles with detect uptime some time ago, this is normal for some incorrect reboot triggers too.

But you still see reboot event logs for this devices (how often)?

Mike Stupalov added a comment - 2017/10/17 09:49 PM - edited Uptime must be faithful (in your case from hrSystemUptime.0).. Since was troubles with detect uptime some time ago, this is normal for some incorrect reboot triggers too. But you still see reboot event logs for this devices (how often)?

Steven Robson added a comment - 2017/10/17 07:24 PM

Looking at this some more, it seems that uptime reporting in the device view is now correct for all affected devices, but something's still wrong with the logic behind "has the device rebooted" for the alert check.

Steven Robson added a comment - 2017/10/17 07:24 PM Looking at this some more, it seems that uptime reporting in the device view is now correct for all affected devices, but something's still wrong with the logic behind "has the device rebooted" for the alert check.

Steven Robson added a comment - 2017/10/17 04:49 PM

This doesn't appear to be fixed on r8910. We get multiple "device rebooted" alerts per day for most Linux servers, and on some of them, extending snmpd isn't an option (Linux-based appliances). Only Linux devices seem to be affected.

One such server:

HOST-RESOURCES-MIB::hrSystemUptime.0 = Timeticks: (2114597183) 244 days, 17:52:51.83

SNMPv2-SMI::enterprises.15601.1.1.0 = Timeticks: (2114597864) 244 days, 17:52:58.64

SNMP-FRAMEWORK-MIB::snmpEngineBoots.0 = INTEGER: 287
SNMP-FRAMEWORK-MIB::snmpEngineTime.0 = INTEGER: 3869 seconds

DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (385986) 1:04:19.86

Uptime is reported correctly in the web interface:

Uptime 244 days, 17h 59m 15s

However, for other servers, it's reported incorrectly, e.g:

Observium Alert Uptime 1h 39s

16:34pm up 90 days 18:56, 1 user, load average: 0.11, 0.09, 0.09

HOST-RESOURCES-MIB::hrSystemUptime.0 = Timeticks: (784425551) 90 days, 18:57:35.51

SNMPv2-SMI::enterprises.15601.1.1.0 = Timeticks: (784427721) 90 days, 18:57:57.21

SNMP-FRAMEWORK-MIB::snmpEngineBoots.0 = INTEGER: 909
SNMP-FRAMEWORK-MIB::snmpEngineTime.0 = INTEGER: 3960 seconds

DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (392875) 1:05:28.75

But Observium thinks the server has recently rebooted in both cases - we got an alert and it's shown on the Observium homepage.

The alert check is "device_rebooted eq 1" (also tried ne 0).

Steven Robson added a comment - 2017/10/17 04:49 PM This doesn't appear to be fixed on r8910. We get multiple "device rebooted" alerts per day for most Linux servers, and on some of them, extending snmpd isn't an option (Linux-based appliances). Only Linux devices seem to be affected. One such server: HOST-RESOURCES-MIB::hrSystemUptime.0 = Timeticks: (2114597183) 244 days, 17:52:51.83 SNMPv2-SMI::enterprises.15601.1.1.0 = Timeticks: (2114597864) 244 days, 17:52:58.64 SNMP-FRAMEWORK-MIB::snmpEngineBoots.0 = INTEGER: 287 SNMP-FRAMEWORK-MIB::snmpEngineTime.0 = INTEGER: 3869 seconds DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (385986) 1:04:19.86 Uptime is reported correctly in the web interface: Uptime 244 days, 17h 59m 15s However, for other servers, it's reported incorrectly, e.g: Observium Alert Uptime 1h 39s 16:34pm up 90 days 18:56, 1 user, load average: 0.11, 0.09, 0.09 HOST-RESOURCES-MIB::hrSystemUptime.0 = Timeticks: (784425551) 90 days, 18:57:35.51 SNMPv2-SMI::enterprises.15601.1.1.0 = Timeticks: (784427721) 90 days, 18:57:57.21 SNMP-FRAMEWORK-MIB::snmpEngineBoots.0 = INTEGER: 909 SNMP-FRAMEWORK-MIB::snmpEngineTime.0 = INTEGER: 3960 seconds DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (392875) 1:05:28.75 But Observium thinks the server has recently rebooted in both cases - we got an alert and it's shown on the Observium homepage. The alert check is "device_rebooted eq 1" (also tried ne 0).

Mike Stupalov added a comment - 2017/10/14 12:43 PM

Fixed in r8897.

Mike Stupalov added a comment - 2017/10/14 12:43 PM Fixed in r8897.

Adam Armstrong added a comment - 2017/10/14 05:04 AM

I've added the ability to inject long-term accurate uptime into net-snmp (we already do this via the unix-agent, but not everyone likes using that!)

If you add this to your snmpd.conf:

extend uptime /bin/cat /proc/uptime

Observium will pick it up during polling and use it in preference to everything else (except unix-agent).

Adam Armstrong added a comment - 2017/10/14 05:04 AM I've added the ability to inject long-term accurate uptime into net-snmp (we already do this via the unix-agent, but not everyone likes using that!) If you add this to your snmpd.conf: extend uptime /bin/cat /proc/uptime Observium will pick it up during polling and use it in preference to everything else (except unix-agent).

Adam Armstrong added a comment - 2017/10/14 04:19 AM

This is actually because it's now using the uptime of the snmp daemon, which I assume has been restarted or updated since those systems were last booted. It does indeed seem to be mostly the wrong thing to do in this instance.

Note that on Linux systems the uptime counter from hrSystemUptime will wrap after 450 or so days in any event

Adam Armstrong added a comment - 2017/10/14 04:19 AM This is actually because it's now using the uptime of the snmp daemon, which I assume has been restarted or updated since those systems were last booted. It does indeed seem to be mostly the wrong thing to do in this instance. Note that on Linux systems the uptime counter from hrSystemUptime will wrap after 450 or so days in any event

People

Assignee:: Mike Stupalov

Reporter:: Steven Robson

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 2017/10/13 03:44 PM

Updated:: 2017/11/19 02:10 AM

Resolved:: 2017/10/19 08:36 AM