[OBS-2464] r8892 Appears to Break Uptime Monitoring on Linux Systems - Observium

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: None
Affects Version/s: Professional Edition
Component/s: Poller
Labels:
None
Environment:
SLES 11/12 and RHEL 7

Description

Hi,

r8892 breaks uptime monitoring on Linux servers, as far as we can see. I've just updated to 8893 and received a device rebooted alert for 30+ servers, and for those servers, the reported uptime is wrong. Seems to be newer distros only. Old CentOS boxes aren't affected, but new RHEL and SLES ones are.

SVN log shows:
r8892 | mike | 2017-10-12 14:11:41 +0100 (Thu, 12 Oct 2017) | 2 lines

[MINOR] Prioritizing snmpEngineTime over hrSystemUptime and sysUptime. Clean old geolocation parts.

This seems to be the wrong thing to do for Linux systems, because only hrSystemUptime seems to report the correct system uptime, as reported by the "uptime" command.

For example, we've a server that has been up for 15 days, 22 hours.
$ uptime
14:43pm up 15 days 22:39, 1 user, load average: 0.02, 0.05, 0.01

Observium's device page now reports it as Uptime 3h 50m 35s

snmpwalk shows this:
DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (1300251) 3:36:42.51
SNMP-FRAMEWORK-MIB::snmpEngineTime.0 = INTEGER: 13357 seconds
HOST-RESOURCES-MIB::hrSystemUptime.0 = Timeticks: (137807404) 15 days, 22:47:54.04

Please can this priority be fixed.

Thanks,

Steven

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

screenshot-1.png
18 kB
2017/10/20 02:15 AM

Activity

[OBS-2464] r8892 Appears to Break Uptime Monitoring on Linux Systems

Steven Robson added a comment - 2017/10/18 11:27 AM - edited

I was wrong. In some circumstances, Observium is using the wrong OID for system uptime, and reports this in the web interface, and then alerts. So there's still something wrong with the way it's detecting uptime for these systems.

Using r8912

Example:

Uptime in the web interface was reported as about 5mins (refreshed before I could catch the exact value)

uptime (via ssh)
11:09am up 2 days 0:26, 1 user, load average: 0.14, 0.09, 0.06

SNMP-FRAMEWORK-MIB::snmpEngineBoots.0 = INTEGER: 835
SNMP-FRAMEWORK-MIB::snmpEngineTime.0 = INTEGER: 670 seconds

DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (64400) 0:10:44.00

HOST-RESOURCES-MIB::hrSystemUptime.0 = Timeticks: (17443127) 2 days, 0:27:11.27

The alert email reports Uptime 5m 33s

So, I don't think uptime is faithful for these systems, but is corrected on a later poll and then breaks again.

The event log for the device shows:

2017-10-18 11:05:35 servername Device rebooted: after 2 days, 16m 33s
2017-10-16 16:15:39 servername Device rebooted: after 5h 27m 7s
2017-10-16 14:55:30 servername Device rebooted: after 4h 6m 53s
2017-10-16 14:25:49 servername Device rebooted: after 3h 20m 4s
2017-10-16 14:20:10 servername Device rebooted: after 3h 32m 8s
2017-10-16 14:10:33 servername Device rebooted: after 3h 22m 12s
2017-10-16 14:00:26 servername Device rebooted: after 3h 12m 3s
2017-10-16 13:50:28 servername Device rebooted: after 3h 1m 46s
2017-10-16 13:20:09 servername Device rebooted: after 2h 32m 7s
2017-10-16 13:00:57 servername Device rebooted: after 2h 12m 12s
2017-10-16 12:20:33 servername Device rebooted: after 1h 32m 16s
2017-10-16 11:10:44 servername Device rebooted: after 22m 20s
2017-10-16 11:01:09 servername Device rebooted: after 10m 40s

If I check the uptime now, it shows correctly as 2 days, 32m 6s

Steven Robson added a comment - 2017/10/18 11:27 AM - edited I was wrong. In some circumstances, Observium is using the wrong OID for system uptime, and reports this in the web interface, and then alerts. So there's still something wrong with the way it's detecting uptime for these systems. Using r8912 Example: Uptime in the web interface was reported as about 5mins (refreshed before I could catch the exact value) uptime (via ssh) 11:09am up 2 days 0:26, 1 user, load average: 0.14, 0.09, 0.06 SNMP-FRAMEWORK-MIB::snmpEngineBoots.0 = INTEGER: 835 SNMP-FRAMEWORK-MIB::snmpEngineTime.0 = INTEGER: 670 seconds DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (64400) 0:10:44.00 HOST-RESOURCES-MIB::hrSystemUptime.0 = Timeticks: (17443127) 2 days, 0:27:11.27 The alert email reports Uptime 5m 33s So, I don't think uptime is faithful for these systems, but is corrected on a later poll and then breaks again. The event log for the device shows: 2017-10-18 11:05:35 servername Device rebooted: after 2 days, 16m 33s 2017-10-16 16:15:39 servername Device rebooted: after 5h 27m 7s 2017-10-16 14:55:30 servername Device rebooted: after 4h 6m 53s 2017-10-16 14:25:49 servername Device rebooted: after 3h 20m 4s 2017-10-16 14:20:10 servername Device rebooted: after 3h 32m 8s 2017-10-16 14:10:33 servername Device rebooted: after 3h 22m 12s 2017-10-16 14:00:26 servername Device rebooted: after 3h 12m 3s 2017-10-16 13:50:28 servername Device rebooted: after 3h 1m 46s 2017-10-16 13:20:09 servername Device rebooted: after 2h 32m 7s 2017-10-16 13:00:57 servername Device rebooted: after 2h 12m 12s 2017-10-16 12:20:33 servername Device rebooted: after 1h 32m 16s 2017-10-16 11:10:44 servername Device rebooted: after 22m 20s 2017-10-16 11:01:09 servername Device rebooted: after 10m 40s If I check the uptime now, it shows correctly as 2 days, 32m 6s

Mike Stupalov added a comment - 2017/10/17 09:49 PM - edited

Uptime must be faithful (in your case from hrSystemUptime.0)..

Since was troubles with detect uptime some time ago, this is normal for some incorrect reboot triggers too.

But you still see reboot event logs for this devices (how often)?

Mike Stupalov added a comment - 2017/10/17 09:49 PM - edited Uptime must be faithful (in your case from hrSystemUptime.0).. Since was troubles with detect uptime some time ago, this is normal for some incorrect reboot triggers too. But you still see reboot event logs for this devices (how often)?

Steven Robson added a comment - 2017/10/17 07:24 PM

Looking at this some more, it seems that uptime reporting in the device view is now correct for all affected devices, but something's still wrong with the logic behind "has the device rebooted" for the alert check.

Steven Robson added a comment - 2017/10/17 07:24 PM Looking at this some more, it seems that uptime reporting in the device view is now correct for all affected devices, but something's still wrong with the logic behind "has the device rebooted" for the alert check.

Steven Robson added a comment - 2017/10/17 04:49 PM

This doesn't appear to be fixed on r8910. We get multiple "device rebooted" alerts per day for most Linux servers, and on some of them, extending snmpd isn't an option (Linux-based appliances). Only Linux devices seem to be affected.

One such server:

HOST-RESOURCES-MIB::hrSystemUptime.0 = Timeticks: (2114597183) 244 days, 17:52:51.83

SNMPv2-SMI::enterprises.15601.1.1.0 = Timeticks: (2114597864) 244 days, 17:52:58.64

SNMP-FRAMEWORK-MIB::snmpEngineBoots.0 = INTEGER: 287
SNMP-FRAMEWORK-MIB::snmpEngineTime.0 = INTEGER: 3869 seconds

DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (385986) 1:04:19.86

Uptime is reported correctly in the web interface:

Uptime 244 days, 17h 59m 15s

However, for other servers, it's reported incorrectly, e.g:

Observium Alert Uptime 1h 39s

16:34pm up 90 days 18:56, 1 user, load average: 0.11, 0.09, 0.09

HOST-RESOURCES-MIB::hrSystemUptime.0 = Timeticks: (784425551) 90 days, 18:57:35.51

SNMPv2-SMI::enterprises.15601.1.1.0 = Timeticks: (784427721) 90 days, 18:57:57.21

SNMP-FRAMEWORK-MIB::snmpEngineBoots.0 = INTEGER: 909
SNMP-FRAMEWORK-MIB::snmpEngineTime.0 = INTEGER: 3960 seconds

DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (392875) 1:05:28.75

But Observium thinks the server has recently rebooted in both cases - we got an alert and it's shown on the Observium homepage.

The alert check is "device_rebooted eq 1" (also tried ne 0).

Steven Robson added a comment - 2017/10/17 04:49 PM This doesn't appear to be fixed on r8910. We get multiple "device rebooted" alerts per day for most Linux servers, and on some of them, extending snmpd isn't an option (Linux-based appliances). Only Linux devices seem to be affected. One such server: HOST-RESOURCES-MIB::hrSystemUptime.0 = Timeticks: (2114597183) 244 days, 17:52:51.83 SNMPv2-SMI::enterprises.15601.1.1.0 = Timeticks: (2114597864) 244 days, 17:52:58.64 SNMP-FRAMEWORK-MIB::snmpEngineBoots.0 = INTEGER: 287 SNMP-FRAMEWORK-MIB::snmpEngineTime.0 = INTEGER: 3869 seconds DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (385986) 1:04:19.86 Uptime is reported correctly in the web interface: Uptime 244 days, 17h 59m 15s However, for other servers, it's reported incorrectly, e.g: Observium Alert Uptime 1h 39s 16:34pm up 90 days 18:56, 1 user, load average: 0.11, 0.09, 0.09 HOST-RESOURCES-MIB::hrSystemUptime.0 = Timeticks: (784425551) 90 days, 18:57:35.51 SNMPv2-SMI::enterprises.15601.1.1.0 = Timeticks: (784427721) 90 days, 18:57:57.21 SNMP-FRAMEWORK-MIB::snmpEngineBoots.0 = INTEGER: 909 SNMP-FRAMEWORK-MIB::snmpEngineTime.0 = INTEGER: 3960 seconds DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (392875) 1:05:28.75 But Observium thinks the server has recently rebooted in both cases - we got an alert and it's shown on the Observium homepage. The alert check is "device_rebooted eq 1" (also tried ne 0).

Mike Stupalov added a comment - 2017/10/14 12:43 PM

Fixed in r8897.

Mike Stupalov added a comment - 2017/10/14 12:43 PM Fixed in r8897.

Adam Armstrong added a comment - 2017/10/14 05:04 AM

I've added the ability to inject long-term accurate uptime into net-snmp (we already do this via the unix-agent, but not everyone likes using that!)

If you add this to your snmpd.conf:

extend uptime /bin/cat /proc/uptime

Observium will pick it up during polling and use it in preference to everything else (except unix-agent).

Adam Armstrong added a comment - 2017/10/14 05:04 AM I've added the ability to inject long-term accurate uptime into net-snmp (we already do this via the unix-agent, but not everyone likes using that!) If you add this to your snmpd.conf: extend uptime /bin/cat /proc/uptime Observium will pick it up during polling and use it in preference to everything else (except unix-agent).

Adam Armstrong added a comment - 2017/10/14 04:19 AM

This is actually because it's now using the uptime of the snmp daemon, which I assume has been restarted or updated since those systems were last booted. It does indeed seem to be mostly the wrong thing to do in this instance.

Note that on Linux systems the uptime counter from hrSystemUptime will wrap after 450 or so days in any event

Adam Armstrong added a comment - 2017/10/14 04:19 AM This is actually because it's now using the uptime of the snmp daemon, which I assume has been restarted or updated since those systems were last booted. It does indeed seem to be mostly the wrong thing to do in this instance. Note that on Linux systems the uptime counter from hrSystemUptime will wrap after 450 or so days in any event

People

Assignee:: Mike Stupalov

Reporter:: Steven Robson

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 2017/10/13 03:44 PM

Updated:: 2017/11/19 02:10 AM

Resolved:: 2017/10/19 08:36 AM