Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
Professional Edition
-
None
-
SLES 11/12 and RHEL 7
Description
Hi,
r8892 breaks uptime monitoring on Linux servers, as far as we can see. I've just updated to 8893 and received a device rebooted alert for 30+ servers, and for those servers, the reported uptime is wrong. Seems to be newer distros only. Old CentOS boxes aren't affected, but new RHEL and SLES ones are.
SVN log shows:
r8892 | mike | 2017-10-12 14:11:41 +0100 (Thu, 12 Oct 2017) | 2 lines
[MINOR] Prioritizing snmpEngineTime over hrSystemUptime and sysUptime. Clean old geolocation parts.
This seems to be the wrong thing to do for Linux systems, because only hrSystemUptime seems to report the correct system uptime, as reported by the "uptime" command.
For example, we've a server that has been up for 15 days, 22 hours.
$ uptime
14:43pm up 15 days 22:39, 1 user, load average: 0.02, 0.05, 0.01
Observium's device page now reports it as Uptime 3h 50m 35s
snmpwalk shows this:
DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (1300251) 3:36:42.51
SNMP-FRAMEWORK-MIB::snmpEngineTime.0 = INTEGER: 13357 seconds
HOST-RESOURCES-MIB::hrSystemUptime.0 = Timeticks: (137807404) 15 days, 22:47:54.04
Please can this priority be fixed.
Thanks,
Steven
I was wrong. In some circumstances, Observium is using the wrong OID for system uptime, and reports this in the web interface, and then alerts. So there's still something wrong with the way it's detecting uptime for these systems.
Using r8912
Example:
Uptime in the web interface was reported as about 5mins (refreshed before I could catch the exact value)
uptime (via ssh)
11:09am up 2 days 0:26, 1 user, load average: 0.14, 0.09, 0.06
SNMP-FRAMEWORK-MIB::snmpEngineBoots.0 = INTEGER: 835
SNMP-FRAMEWORK-MIB::snmpEngineTime.0 = INTEGER: 670 seconds
DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (64400) 0:10:44.00
HOST-RESOURCES-MIB::hrSystemUptime.0 = Timeticks: (17443127) 2 days, 0:27:11.27
The alert email reports Uptime 5m 33s
So, I don't think uptime is faithful for these systems, but is corrected on a later poll and then breaks again.
The event log for the device shows:
2017-10-18 11:05:35 servername Device rebooted: after 2 days, 16m 33s
2017-10-16 16:15:39 servername Device rebooted: after 5h 27m 7s
2017-10-16 14:55:30 servername Device rebooted: after 4h 6m 53s
2017-10-16 14:25:49 servername Device rebooted: after 3h 20m 4s
2017-10-16 14:20:10 servername Device rebooted: after 3h 32m 8s
2017-10-16 14:10:33 servername Device rebooted: after 3h 22m 12s
2017-10-16 14:00:26 servername Device rebooted: after 3h 12m 3s
2017-10-16 13:50:28 servername Device rebooted: after 3h 1m 46s
2017-10-16 13:20:09 servername Device rebooted: after 2h 32m 7s
2017-10-16 13:00:57 servername Device rebooted: after 2h 12m 12s
2017-10-16 12:20:33 servername Device rebooted: after 1h 32m 16s
2017-10-16 11:10:44 servername Device rebooted: after 22m 20s
2017-10-16 11:01:09 servername Device rebooted: after 10m 40s
If I check the uptime now, it shows correctly as 2 days, 32m 6s