Uploaded image for project: 'Observium'
  1. Observium
  2. OBS-3187

IPMI chassis en power supply support

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Professional Edition
    • Unix Agent

    Description

      The function parse_ipmitool_sensor() is not working correctly due to the empty discrete value in includes/definitions/entities/sensors.inc.php.

      The following statements in parse_ipmitool_sensor() (includes/entities/sensor.inc.php) returns true if the $unit is discrete:

      if (isset($config['ipmi_unit'][$unit])) {
      

      and the code below (from OBS-3028) will never run:

      } elseif ($unit == 'discrete') {
      

      because an empty value is set in the array, so isset() returns true.

      The attached patch will remove empty value from the array. Another option could be checking is there is an actual value in the unit-array:

      if (isset($config['ipmi_unit'][$unit]) && $config['ipmi_unit'][$unit]) {
      

      Attachments

        1. ipmi_discrete.diff
          0.5 kB
        2. ipmi_patch_bernard.diff
          4 kB
        3. ipmi_patch_rework.diff
          4 kB
        4. ipmi_sensors.diff
          3 kB
        5. ipmi_sensors.diff
          3 kB
        6. ipmitool-sdr-r730.txt
          69 kB
        7. ipmitool-sensors-r730.txt
          21 kB
        8. poller_patched.log
          115 kB
        9. poller.log
          111 kB
        10. poller-r730.txt
          329 kB

        Issue Links

          Activity

            [OBS-3187] IPMI chassis en power supply support

            For now I'll submit my last patch regarding this subject. It takes the $state value instead of the $current value which should work for both of our cases for now.

            I agree the discrete sensor implementation is highly unstable and vendor specific so a more generic solution would be a good thing to have.

            The "works for me hope it also works for you" patch: ipmi_patch_bernard.diff

            veldenb Bernard van der Velden added a comment - For now I'll submit my last patch regarding this subject. It takes the $state value instead of the $current value which should work for both of our cases for now. I agree the discrete sensor implementation is highly unstable and vendor specific so a more generic solution would be a good thing to have. The "works for me hope it also works for you" patch: ipmi_patch_bernard.diff

            I see, the "ipmitool sdr" command might generate some more sensible output (it has an "ok" column) for discrete sensors instead of "ipmitool sensor".

            Output from the same server:

            ipmitool sensor
             
            Chassis Intru    | 0x0        | discrete   | 0x0000| na        | na        | na        | na        | na        | na        
            PS1 Status       | 0x1        | discrete   | 0x0100| na        | na        | na        | na        | na        | na        
            PS2 Status       | 0x1        | discrete   | 0x0100| na        | na        | na        | na        | na        | na        

            ipmitool sdr
             
            Chassis Intru    | 0x00              | ok
            PS1 Status       | 0x01              | ok
            PS2 Status       | 0x01              | ok

            veldenb Bernard van der Velden added a comment - I see, the "ipmitool sdr" command might generate some more sensible output (it has an "ok" column) for discrete sensors instead of "ipmitool sensor". Output from the same server: ipmitool sensor   Chassis Intru | 0x0 | discrete | 0x0000| na | na | na | na | na | na PS1 Status | 0x1 | discrete | 0x0100| na | na | na | na | na | na PS2 Status | 0x1 | discrete | 0x0100| na | na | na | na | na | na ipmitool sdr   Chassis Intru | 0x00 | ok PS1 Status | 0x01 | ok PS2 Status | 0x01 | ok

            I.e. this is my Intrusion sensor really in Alert state:

             Intrusion        | 0x0        | discrete   | 0x0100| na        | na        | na        | na        | na        | na

            but for your platform as I think in Ok state:

            Chassis Intru    | 0x0        | discrete   | 0x0000| na        | na        | na        | na        | na        | na

            landy Mike Stupalov added a comment - I.e. this is my Intrusion sensor really in Alert state: Intrusion | 0x0 | discrete | 0x0100| na | na | na | na | na | na but for your platform as I think in Ok state: Chassis Intru | 0x0 | discrete | 0x0000| na | na | na | na | na | na

            As I discovered in many instances, discrete sensors really more hard for detect correct status.
            This is bitwise status and not possible for correctly set as status by simple compare.
            Your patch will work for your device(s) only and not correct for any (many) others.
            I.e. for my supermicro platform this patch is incorrect!..

            Now I complete not want improve this discrete statuses, until there is a unified method for various platforms.

            landy Mike Stupalov added a comment - As I discovered in many instances, discrete sensors really more hard for detect correct status. This is bitwise status and not possible for correctly set as status by simple compare. Your patch will work for your device(s) only and not correct for any (many) others. I.e. for my supermicro platform this patch is incorrect!.. Now I complete not want improve this discrete statuses, until there is a unified method for various platforms.
            veldenb Bernard van der Velden added a comment - - edited

            This change introduces code that compares the values $current and $state which are in practice the same values. This results in checking a discrete sensor will always report as healthy:

             
            ipmitool output compared to Supermicro sensor readings from ipmi frontend:
             
            Server 1
            Chassis Intru    | 0x1        | discrete   | 0x0100| na        | na        | na        | na        | na        | na        -> not healthy, not failing in Observium since r10395 while it should
            PS1 Status       | 0x1        | discrete   | 0x0100| na        | na        | na        | na        | na        | na        -> healthy
            PS2 Status       | 0x1        | discrete   | 0x0100| na        | na        | na        | na        | na        | na        -> healthy
            
            

            I did some testing on server 2:

             
            Server 2 - normal
            VBAT             | 0x4        | discrete   | 0x04ff| na        | na        | na        | na        | na        | na        -> healthy
            Chassis Intru    | 0x0        | discrete   | 0x0000| na        | na        | na        | na        | na        | na        -> healthy
            PS1 Status       | 0x1        | discrete   | 0x0100| na        | na        | na        | na        | na        | na        -> healthy
            PS2 Status       | 0x1        | discrete   | 0x0100| na        | na        | na        | na        | na        | na        -> healthy, power supply is inserted and power cable is connected, reports as healthy in Observium
             
            Server 2 - removed PS2
            VBAT             | 0x4        | discrete   | 0x04ff| na        | na        | na        | na        | na        | na        -> healthy
            Chassis Intru    | 0x0        | discrete   | 0x0000| na        | na        | na        | na        | na        | na        -> healthy
            PS1 Status       | 0x1        | discrete   | 0x0100| na        | na        | na        | na        | na        | na        -> healthy
            PS2 Status       | 0x0        | discrete   | 0x0000| na        | na        | na        | na        | na        | na        -> I would like to mark this as a failure but this depends on your own setup, reports as healthy in Observium
             
            Server 2 - removed power cord PS2
            VBAT             | 0x4        | discrete   | 0x04ff| na        | na        | na        | na        | na        | na        -> healthy
            Chassis Intru    | 0x0        | discrete   | 0x0000| na        | na        | na        | na        | na        | na        -> healthy
            PS1 Status       | 0x1        | discrete   | 0x0100| na        | na        | na        | na        | na        | na        -> healthy
            PS2 Status       | 0xb        | discrete   | 0x0b00| na        | na        | na        | na        | na        | na        -> not healthy, a power supply should have power, reports as healthy in Observium since r10395
            
            

             It seems to be required that the state is compared to the hard coded values related to the sensor names.

            Observium now reports my Chassis intrusion sensor as healthy while it is not, and probably will do the same in the case of a failing power supply. This is a big issue for us because we need to be able to check if a power supply is failing om our servers.

            From what I can find the following states or communicated by ipmitool for discrete sensors:

            Power supply:
            0x0=power supply unit not present
            0x1=status ok
            0x3=power supply off or failed
            0xB=Input out of range (ex. No AC input)
             
            Chassis Intrusion:
            0x0=No Intrusion
            0x1=Intrusion
             
            VBAT:
            0x4=Healthy
            

            Can this issue be reopened?

            Edit: I made a patch which detects failing sensors correctly again in our setup: ipmi_patch_rework.diff

            veldenb Bernard van der Velden added a comment - - edited This change introduces code that compares the values $current and $state which are in practice the same values. This results in checking a discrete sensor will always report as healthy:   ipmitool output compared to Supermicro sensor readings from ipmi frontend:   Server 1 Chassis Intru | 0x1 | discrete | 0x0100| na | na | na | na | na | na -> not healthy, not failing in Observium since r10395 while it should PS1 Status | 0x1 | discrete | 0x0100| na | na | na | na | na | na -> healthy PS2 Status | 0x1 | discrete | 0x0100| na | na | na | na | na | na -> healthy I did some testing on server 2:   Server 2 - normal VBAT | 0x4 | discrete | 0x04ff| na | na | na | na | na | na -> healthy Chassis Intru | 0x0 | discrete | 0x0000| na | na | na | na | na | na -> healthy PS1 Status | 0x1 | discrete | 0x0100| na | na | na | na | na | na -> healthy PS2 Status | 0x1 | discrete | 0x0100| na | na | na | na | na | na -> healthy, power supply is inserted and power cable is connected, reports as healthy in Observium   Server 2 - removed PS2 VBAT | 0x4 | discrete | 0x04ff| na | na | na | na | na | na -> healthy Chassis Intru | 0x0 | discrete | 0x0000| na | na | na | na | na | na -> healthy PS1 Status | 0x1 | discrete | 0x0100| na | na | na | na | na | na -> healthy PS2 Status | 0x0 | discrete | 0x0000| na | na | na | na | na | na -> I would like to mark this as a failure but this depends on your own setup, reports as healthy in Observium   Server 2 - removed power cord PS2 VBAT | 0x4 | discrete | 0x04ff| na | na | na | na | na | na -> healthy Chassis Intru | 0x0 | discrete | 0x0000| na | na | na | na | na | na -> healthy PS1 Status | 0x1 | discrete | 0x0100| na | na | na | na | na | na -> healthy PS2 Status | 0xb | discrete | 0x0b00| na | na | na | na | na | na -> not healthy, a power supply should have power, reports as healthy in Observium since r10395  It seems to be required that the state is compared to the hard coded values related to the sensor names. Observium now reports my Chassis intrusion sensor as healthy while it is not, and probably will do the same in the case of a failing power supply. This is a big issue for us because we need to be able to check if a power supply is failing om our servers. From what I can find the following states or communicated by ipmitool for discrete sensors: Power supply: 0x0=power supply unit not present 0x1=status ok 0x3=power supply off or failed 0xB=Input out of range (ex. No AC input)   Chassis Intrusion: 0x0=No Intrusion 0x1=Intrusion   VBAT: 0x4=Healthy Can this issue be reopened? Edit: I made a patch which detects failing sensors correctly again in our setup: ipmi_patch_rework.diff

            Sure. I've attached the poller output, along with the output from calling 'ipmitool sensor' and 'ipmitool -v sdr' directly. The latter file shows that (for example) the sensor named 'PS1 PG Fail' is of the 'Voltage' type, not 'Power Supply', so its assertions don't match what the existing entPhysicalClass 'powersupply' matching does.

            ipmitool-sdr-r730.txt

            andrewbonney Andrew Bonney added a comment - Sure. I've attached the poller output, along with the output from calling 'ipmitool sensor' and 'ipmitool -v sdr' directly. The latter file shows that (for example) the sensor named 'PS1 PG Fail' is of the 'Voltage' type, not 'Power Supply', so its assertions don't match what the existing entPhysicalClass 'powersupply' matching does. ipmitool-sdr-r730.txt

            IPMI sensors polled without changes.

            landy Mike Stupalov added a comment - IPMI sensors polled without changes.

            Thanks Mike. Would the sensor output still be useful?

            andrewbonney Andrew Bonney added a comment - Thanks Mike. Would the sensor output still be useful?

            Note, in r10396 discrete statuses disabled by default, while support unstable and many false positives.
            For enable, go to Global settings -> Entities -> check "Enable polling IPMI discrete sensors".

            landy Mike Stupalov added a comment - Note, in r10396 discrete statuses disabled by default, while support unstable and many false positives. For enable, go to Global settings -> Entities -> check "Enable polling IPMI discrete sensors".

            Ok, now should be fixed in r10395.

            landy Mike Stupalov added a comment - Ok, now should be fixed in r10395.

            People

              landy Mike Stupalov
              veldenb Bernard van der Velden
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: