This change introduces code that compares the values $current and $state which are in practice the same values. This results in checking a discrete sensor will always report as healthy:
|
ipmitool output compared to Supermicro sensor readings from ipmi frontend:
|
|
Server 1
|
Chassis Intru | 0x1 | discrete | 0x0100| na | na | na | na | na | na -> not healthy, not failing in Observium since r10395 while it should
|
PS1 Status | 0x1 | discrete | 0x0100| na | na | na | na | na | na -> healthy
|
PS2 Status | 0x1 | discrete | 0x0100| na | na | na | na | na | na -> healthy
|
|
I did some testing on server 2:
|
Server 2 - normal
|
VBAT | 0x4 | discrete | 0x04ff| na | na | na | na | na | na -> healthy
|
Chassis Intru | 0x0 | discrete | 0x0000| na | na | na | na | na | na -> healthy
|
PS1 Status | 0x1 | discrete | 0x0100| na | na | na | na | na | na -> healthy
|
PS2 Status | 0x1 | discrete | 0x0100| na | na | na | na | na | na -> healthy, power supply is inserted and power cable is connected, reports as healthy in Observium
|
|
Server 2 - removed PS2
|
VBAT | 0x4 | discrete | 0x04ff| na | na | na | na | na | na -> healthy
|
Chassis Intru | 0x0 | discrete | 0x0000| na | na | na | na | na | na -> healthy
|
PS1 Status | 0x1 | discrete | 0x0100| na | na | na | na | na | na -> healthy
|
PS2 Status | 0x0 | discrete | 0x0000| na | na | na | na | na | na -> I would like to mark this as a failure but this depends on your own setup, reports as healthy in Observium
|
|
Server 2 - removed power cord PS2
|
VBAT | 0x4 | discrete | 0x04ff| na | na | na | na | na | na -> healthy
|
Chassis Intru | 0x0 | discrete | 0x0000| na | na | na | na | na | na -> healthy
|
PS1 Status | 0x1 | discrete | 0x0100| na | na | na | na | na | na -> healthy
|
PS2 Status | 0xb | discrete | 0x0b00| na | na | na | na | na | na -> not healthy, a power supply should have power, reports as healthy in Observium since r10395
|
|
It seems to be required that the state is compared to the hard coded values related to the sensor names.
Observium now reports my Chassis intrusion sensor as healthy while it is not, and probably will do the same in the case of a failing power supply. This is a big issue for us because we need to be able to check if a power supply is failing om our servers.
From what I can find the following states or communicated by ipmitool for discrete sensors:
Power supply:
|
0x0=power supply unit not present
|
0x1=status ok
|
0x3=power supply off or failed
|
0xB=Input out of range (ex. No AC input)
|
|
Chassis Intrusion:
|
0x0=No Intrusion
|
0x1=Intrusion
|
|
VBAT:
|
0x4=Healthy
|
Can this issue be reopened?
Edit: I made a patch which detects failing sensors correctly again in our setup: ipmi_patch_rework.diff
For now I'll submit my last patch regarding this subject. It takes the $state value instead of the $current value which should work for both of our cases for now.
I agree the discrete sensor implementation is highly unstable and vendor specific so a more generic solution would be a good thing to have.
The "works for me hope it also works for you" patch: ipmi_patch_bernard.diff