[OBS-4682] iDRAC interfaces appearing/disappearing, filling disk/database - Observium

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Minor
Fix Version/s: None
Affects Version/s: Professional Edition
Component/s: Discovery
Labels:
- idrac

Description

We have worked around this issue by filtering out interfaces qs0 qs-sta wwan0 and wwan1 in global discovery settings, but thought it warranted mention. Have filed as minor accordingly.

Monitoring iDRAC9 running firmware version 7.00.30.00, the four above interface names would appear and disappear, adding a row to the ports database table and a new RRD file.

The "normal" number of NICs for one of these iDRACs is 14, at the point the server ran out of disk space and ground to a halt, one of them had >120,000 interfaces.

(blanked the hostnames for anonymity here)

MariaDB [observium]> select d.hostname, count(p.port_id) as port_count from ports p join devices d using (device_id) where d.hostname like 'xxxxxxxxx' group by d.device_id order by port_count desc limit 5;

+-----------+------------+

| hostname  | port_count |

+-----------+------------+

| xxxxxxxxx |     123367 |

| xxxxxxxxx |      78291 |

| xxxxxxxxx |      71461 |

| xxxxxxxxx |      68093 |

| xxxxxxxxx |      52417 |

+-----------+------------+

Aside from the disk space usage, this was picked up because the ~600,000 additional ports was causing the discovery cron job to take >10 hours to run instead of the normal 60 minutes (~3700 devices), as the group re-generation was taking more than a minute per device (compared to a usual 5-10 seconds). Additionally, mysqld was seen consuming up to 2500% CPU. Running SELECT info FROM information_schema.PROCESSLIST showed multiple slow queries running a JOIN against the ports table.

I've attached an snmpwalk for the device that wound up with the 120,000 port count, and a sample of select timestamp,message from eventlog where device_id = for this device, in case there's anything of interest in either.

As noted we have a remediation in place, but thought it would be worth mentioning.

One change that could alleviate CPU usage in these situations - although which might well hide underlying issues that need to be addressed - would be to make the observium-wrapper discovery re-discovery task wait for all devices to have been discovered before re-generating groups, rather than doing it after every device. I notice it is possible to only refresh the groups via discovery.php -a but it is seemingly not possible to run a discovery without the group refresh.

Happy to provide more logs if required.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

idrac.log
523 kB
2023/11/20 09:03 PM

Activity

People

Assignee:: Mike Stupalov

Reporter:: Thomas Kear

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 2023/11/20 09:14 PM

Updated:: 2024/12/13 09:50 AM

Resolved:: 2024/12/13 09:50 AM