Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Professional Edition
    • Discovery

    Description

      We have 4x Cisco ASR1002 routers with 500+ bgp neighbors on them. Discovery of those devices is taking well over an hour and sometimes as high as 2 hours. Doing a debug the hanging happens on the bgp-peers module. It starts hanging after the array is created with all the neighbors it received from the ASR1002. After that it begins all the SQL select statements and those have a massive amount of device_ids included in the statement which makes them crawl.

       

      I have attached a discovery of just a single ASR1002 having issues. I stopped the discovery in the section that slows to a crawl. You should be able to see a few of the sql statements with the massive list of device_ids. Each group of those takes 10-20 seconds to fully complete then it does the next one and the next one. Just this module takes up to an hour to complete on a single router. This issue is causing all our discovery jobs to backup and slow down the server. Nothing ends up completing and I'm forced to manually kill the jobs. Right now I am manually running jobs to discover new devices to avoid those 4 problem devices.

       

      We have a few Cisco 7200s and ASR9ks with almost as many BGP sessions on them, but they complete full discoveries in under 5 minutes. Not sure why these ASR1002s are so much slower.

       

      We have recently just upgraded to the newest code, mysql 10.4.16, and PHP 7.0.33.

      Attachments

        Activity

          [OBS-3523] Discovery of device takes over 1 hour

          Yes, exactly overall devices count is trouble here..

          I'm sure discovery time will now be reduced, but I want to know how much.
          if this is not enough, than I will have to exclude the search neighbour peers by ip.

          landy Mike Stupalov added a comment - Yes, exactly overall devices count is trouble here.. I'm sure discovery time will now be reduced, but I want to know how much. if this is not enough, than I will have to exclude the search neighbour peers by ip.

          I tested the same problem device in production Observium and a test Observium. Both Observium instances are running the same version. The only difference between the production Observium and test Observium is the production has 4k+ devices and the test observium has 20 devices. When production Observium runs the discovery of the problem device it takes over an hour. When test Observium runs a discovery of the same problem device it takes 8 minutes.

           

          I am still working on getting approval to upgrade production Observium to test your changes.

          ajackson Andy Jackson added a comment - I tested the same problem device in production Observium and a test Observium. Both Observium instances are running the same version. The only difference between the production Observium and test Observium is the production has 4k+ devices and the test observium has 20 devices. When production Observium runs the discovery of the problem device it takes over an hour. When test Observium runs a discovery of the same problem device it takes 8 minutes.   I am still working on getting approval to upgrade production Observium to test your changes.

          It is necessary to check with a large number of peers.
          Take your time, I'll wait until you can check on the main system.

          landy Mike Stupalov added a comment - It is necessary to check with a large number of peers. Take your time, I'll wait until you can check on the main system.

          I won't be able to test this out in our production environment quickly.

          As a test however I spun up a new Observium VM and added just this one device to it. A full discovery of the device took under 8 minutes every time I ran it on that device. The only difference I could find is the new instance doesn't have all the device_ids listed in the SQL queries like we have in the production.

          ajackson Andy Jackson added a comment - I won't be able to test this out in our production environment quickly. As a test however I spun up a new Observium VM and added just this one device to it. A full discovery of the device took under 8 minutes every time I ran it on that device. The only difference I could find is the new instance doesn't have all the device_ids listed in the SQL queries like we have in the production.

          Yah, this associate with remote device..

          I speed up this association a little, try in r10845 (note, need switch to rolling updates).

          Not sure if I can speed it up any further.
          I am waiting for your feedback if that is enough ..

          landy Mike Stupalov added a comment - Yah, this associate with remote device.. I speed up this association a little, try in r10845 (note, need switch to rolling updates ). Not sure if I can speed it up any further. I am waiting for your feedback if that is enough ..

          People

            landy Mike Stupalov
            ajackson Andy Jackson
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: