Uploaded image for project: 'Observium'
  1. Observium
  2. OBS-4682

iDRAC interfaces appearing/disappearing, filling disk/database

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Professional Edition
    • Discovery

    Description

      We have worked around this issue by filtering out interfaces qs0 qs-sta wwan0 and wwan1 in global discovery settings, but thought it warranted mention. Have filed as minor accordingly.

      Monitoring iDRAC9 running firmware version 7.00.30.00, the four above interface names would appear and disappear, adding a row to the ports database table and a new RRD file.

      The "normal" number of NICs for one of these iDRACs is 14, at the point the server ran out of disk space and ground to a halt, one of them had >120,000 interfaces.

      (blanked the hostnames for anonymity here)

      MariaDB [observium]> select d.hostname, count(p.port_id) as port_count from ports p join devices d using (device_id) where d.hostname like 'xxxxxxxxx' group by d.device_id order by port_count desc limit 5;
      +-----------+------------+
      | hostname  | port_count |
      +-----------+------------+
      | xxxxxxxxx |     123367 |
      | xxxxxxxxx |      78291 |
      | xxxxxxxxx |      71461 |
      | xxxxxxxxx |      68093 |
      | xxxxxxxxx |      52417 |
      +-----------+------------+
      

      Aside from the disk space usage, this was picked up because the ~600,000 additional ports was causing the discovery cron job to take >10 hours to run instead of the normal 60 minutes (~3700 devices), as the group re-generation was taking more than a minute per device (compared to a usual 5-10 seconds). Additionally, mysqld was seen consuming up to 2500% CPU. Running SELECT info FROM information_schema.PROCESSLIST showed multiple slow queries running a JOIN against the ports table.

      I've attached an snmpwalk for the device that wound up with the 120,000 port count, and a sample of select timestamp,message from eventlog where device_id = for this device, in case there's anything of interest in either.

      As noted we have a remediation in place, but thought it would be worth mentioning.

      One change that could alleviate CPU usage in these situations - although which might well hide underlying issues that need to be addressed - would be to make the observium-wrapper discovery re-discovery task wait for all devices to have been discovered before re-generating groups, rather than doing it after every device. I notice it is possible to only refresh the groups via discovery.php -a but it is seemingly not possible to run a discovery without the group refresh.

      Happy to provide more logs if required.

      Attachments

        1. idrac.log
          523 kB
        2. idrac.snmpwalk
          594 kB

        Activity

          People

            landy Mike Stupalov
            tkear Thomas Kear
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: