This change introduces code that compares the values $current and $state which are in practice the same values. This results in checking a discrete sensor will always report as healthy:
|
ipmitool output compared to Supermicro sensor readings from ipmi frontend:
|
|
Server 1
|
Chassis Intru | 0x1 | discrete | 0x0100| na | na | na | na | na | na -> not healthy, not failing in Observium since r10395 while it should
|
PS1 Status | 0x1 | discrete | 0x0100| na | na | na | na | na | na -> healthy
|
PS2 Status | 0x1 | discrete | 0x0100| na | na | na | na | na | na -> healthy
|
|
I did some testing on server 2:
|
Server 2 - normal
|
VBAT | 0x4 | discrete | 0x04ff| na | na | na | na | na | na -> healthy
|
Chassis Intru | 0x0 | discrete | 0x0000| na | na | na | na | na | na -> healthy
|
PS1 Status | 0x1 | discrete | 0x0100| na | na | na | na | na | na -> healthy
|
PS2 Status | 0x1 | discrete | 0x0100| na | na | na | na | na | na -> healthy, power supply is inserted and power cable is connected, reports as healthy in Observium
|
|
Server 2 - removed PS2
|
VBAT | 0x4 | discrete | 0x04ff| na | na | na | na | na | na -> healthy
|
Chassis Intru | 0x0 | discrete | 0x0000| na | na | na | na | na | na -> healthy
|
PS1 Status | 0x1 | discrete | 0x0100| na | na | na | na | na | na -> healthy
|
PS2 Status | 0x0 | discrete | 0x0000| na | na | na | na | na | na -> I would like to mark this as a failure but this depends on your own setup, reports as healthy in Observium
|
|
Server 2 - removed power cord PS2
|
VBAT | 0x4 | discrete | 0x04ff| na | na | na | na | na | na -> healthy
|
Chassis Intru | 0x0 | discrete | 0x0000| na | na | na | na | na | na -> healthy
|
PS1 Status | 0x1 | discrete | 0x0100| na | na | na | na | na | na -> healthy
|
PS2 Status | 0xb | discrete | 0x0b00| na | na | na | na | na | na -> not healthy, a power supply should have power, reports as healthy in Observium since r10395
|
|
It seems to be required that the state is compared to the hard coded values related to the sensor names.
Observium now reports my Chassis intrusion sensor as healthy while it is not, and probably will do the same in the case of a failing power supply. This is a big issue for us because we need to be able to check if a power supply is failing om our servers.
From what I can find the following states or communicated by ipmitool for discrete sensors:
Power supply:
|
0x0=power supply unit not present
|
0x1=status ok
|
0x3=power supply off or failed
|
0xB=Input out of range (ex. No AC input)
|
|
Chassis Intrusion:
|
0x0=No Intrusion
|
0x1=Intrusion
|
|
VBAT:
|
0x4=Healthy
|
Can this issue be reopened?
Edit: I made a patch which detects failing sensors correctly again in our setup: ipmi_patch_rework.diff
As I discovered in many instances, discrete sensors really more hard for detect correct status.
This is bitwise status and not possible for correctly set as status by simple compare.
Your patch will work for your device(s) only and not correct for any (many) others.
I.e. for my supermicro platform this patch is incorrect!..
Now I complete not want improve this discrete statuses, until there is a unified method for various platforms.