Uploaded image for project: 'Observium'
  1. Observium
  2. OBS-1816

Device Sensors randomly reset and trigger alerts

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Professional Edition
    • Alerting, Poller
    • Hardware WS-C3850-48P
      Operating system Cisco IOS-XE 03.06.00.E (UNIVERSALK9)

    Description

      Randomly cisco devices are "re-discovered" for lack of a better description. When this happens the temperature sensors are reset to values that do not make sense and trigger alerts. We then have to go and manually set the temperature sensor to be a valid value and everything is cleared. This will work fine for a few weeks and then it will randomly be reset again to a negative value. I have attached a screenshot of what it looks like when they are reset. I also see that when this happens a new entry is created and the index number changes. There seems to be a duplicate created?

      Attachments

        1. added.PNG
          141 kB
          Kyle Kot
        2. deleted.PNG
          83 kB
          Kyle Kot
        3. sensor_reset.PNG
          111 kB
          Kyle Kot

        Activity

          [OBS-1816] Device Sensors randomly reset and trigger alerts

          Kyle - One small suggestion that I've done locally. Set your SNMP retries to a higher value, and your timeout to 90 seconds or more.

          Ever since I've done that, I haven't had any of the sensors disappear.

          ciro Ciro Martinez added a comment - Kyle - One small suggestion that I've done locally. Set your SNMP retries to a higher value, and your timeout to 90 seconds or more. Ever since I've done that, I haven't had any of the sensors disappear.
          awesomazing Kyle Kot added a comment -

          So I see the sensors being deleted but we only observe this on cisco devices and during the time SNMP is communicating fine. We do not lose any metrics in graphing and in our other services that operate on SNMP we do not see any interruption. Also during this time other devices in the same location continue to operate fine.

          awesomazing Kyle Kot added a comment - So I see the sensors being deleted but we only observe this on cisco devices and during the time SNMP is communicating fine. We do not lose any metrics in graphing and in our other services that operate on SNMP we do not see any interruption. Also during this time other devices in the same location continue to operate fine.

          Manually set thresholds shouldn't be overwritten by discovery. The most likely explanation for this is that the sensors are being removed and re-added. Can you see this happening in the device's eventlog? This would remove their entries from the database, which would remove the custom thresholds too.

          We've some plans for more flexible thresholding, which might make this less of an issue.

          The ideal fix is to stop whatever's causing communication between the observium server and the device to be flaky, though

          adama Adam Armstrong added a comment - Manually set thresholds shouldn't be overwritten by discovery. The most likely explanation for this is that the sensors are being removed and re-added. Can you see this happening in the device's eventlog? This would remove their entries from the database, which would remove the custom thresholds too. We've some plans for more flexible thresholding, which might make this less of an issue. The ideal fix is to stop whatever's causing communication between the observium server and the device to be flaky, though
          awesomazing Kyle Kot added a comment -

          Is there no way to set a static value? Once this occurs we manually set the value and it is fine for a while. If the value was set manually can we flag it so that it will not be updated by the poller?

          awesomazing Kyle Kot added a comment - Is there no way to set a static value? Once this occurs we manually set the value and it is fine for a while. If the value was set manually can we flag it so that it will not be updated by the poller?

          This is almost certainly related either to connectivity, or to the device's firmware being "unreliable" when responding to SNMP.
          It seems that sometimes when being discovered, it doesn't return the expected data, so all of the entities are removed. There's not much of a solution to this, it's not really something we can fix.

          adama Adam Armstrong added a comment - This is almost certainly related either to connectivity, or to the device's firmware being "unreliable" when responding to SNMP. It seems that sometimes when being discovered, it doesn't return the expected data, so all of the entities are removed. There's not much of a solution to this, it's not really something we can fix.

          This morning another device lost the sensors:

          2h 32m 14s Status deleted: ups-mib-output-state upsOutputSource.0 Source of Output Power
          2h 32m 14s Sensor deleted: frequency ups-mib upsInputEntry.1 Input
          2h 32m 14s Sensor deleted: current ups-mib upsOutputEntry.1 Output
          2h 32m 14s Sensor deleted: voltage ups-mib upsInputEntry.1 Input
          2h 33m 38s xxxx Ip-addresses: 2 deleted.
          2h 33m 38s Lo IP address removed: xxxx
          2h 33m 38s eth0 IP address removed: 127.0.0.1/32

          It happens randomly. I have not found an obvious cause. @Adam: Is there any debugging information you need?

          ciro Ciro Martinez added a comment - This morning another device lost the sensors: 2h 32m 14s Status deleted: ups-mib-output-state upsOutputSource.0 Source of Output Power 2h 32m 14s Sensor deleted: frequency ups-mib upsInputEntry.1 Input 2h 32m 14s Sensor deleted: current ups-mib upsOutputEntry.1 Output 2h 32m 14s Sensor deleted: voltage ups-mib upsInputEntry.1 Input 2h 33m 38s xxxx Ip-addresses: 2 deleted. 2h 33m 38s Lo IP address removed: xxxx 2h 33m 38s eth0 IP address removed: 127.0.0.1/32 It happens randomly. I have not found an obvious cause. @Adam: Is there any debugging information you need?

          I've been experiencing this issue with Tripplite PDUs. The current sensor gets removed and added every few cycles. Or just gets removed and lost.

          5h 24m 26s Status deleted: ups-mib-output-state upsOutputSource.0 Source of Output Power
          5h 24m 26s Sensor deleted: current ups-mib upsOutputEntry.1 Output

          1d 11h 18m Status added: ups-mib-output-state upsOutputSource.0 Source of Output Power
          1d 11h 18m Sensor added: current ups-mib upsOutputEntry.1 Output
          1d 11h 18m Input Sensor added: frequency ups-mib upsInputEntry.1 Input
          1d 11h 18m Input Sensor added: voltage ups-mib upsInputEntry.1 Input

          1d 17h 23m Status deleted: ups-mib-output-state upsOutputSource.0 Source of Output Power
          1d 17h 23m Sensor deleted: frequency ups-mib upsInputEntry.1 Input
          1d 17h 23m Sensor deleted: current ups-mib upsOutputEntry.1 Output
          1d 17h 23m Sensor deleted: voltage ups-mib upsInputEntry.1 Input

          ciro Ciro Martinez added a comment - I've been experiencing this issue with Tripplite PDUs. The current sensor gets removed and added every few cycles. Or just gets removed and lost. 5h 24m 26s Status deleted: ups-mib-output-state upsOutputSource.0 Source of Output Power 5h 24m 26s Sensor deleted: current ups-mib upsOutputEntry.1 Output 1d 11h 18m Status added: ups-mib-output-state upsOutputSource.0 Source of Output Power 1d 11h 18m Sensor added: current ups-mib upsOutputEntry.1 Output 1d 11h 18m Input Sensor added: frequency ups-mib upsInputEntry.1 Input 1d 11h 18m Input Sensor added: voltage ups-mib upsInputEntry.1 Input 1d 17h 23m Status deleted: ups-mib-output-state upsOutputSource.0 Source of Output Power 1d 17h 23m Sensor deleted: frequency ups-mib upsInputEntry.1 Input 1d 17h 23m Sensor deleted: current ups-mib upsOutputEntry.1 Output 1d 17h 23m Sensor deleted: voltage ups-mib upsInputEntry.1 Input

          People

            adama Adam Armstrong
            awesomazing Kyle Kot
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: