Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-32545

Swap index+n1ql node rebalance failed

    XMLWordPrintable

Details

    • Untriaged
    • Unknown

    Attachments

      1. cbq-engine_go_dump_172-23-97-15.txt
        60 kB
        Mahesh Mandhare
      2. cbq-engine_go_dump_172-23-97-19.txt
        60 kB
        Mahesh Mandhare
      3. cbq-engine_go_dump_172-23-97-20.txt
        63 kB
        Mahesh Mandhare
      4. lsof_6.5.0-4059_172-23-97-20.txt
        31.33 MB
        Mahesh Mandhare
      5. lsof.txt
        48.66 MB
        Mahesh Mandhare
      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

      Activity

        Will take look later. Did you remove all unwanted steps and reduce documents in the bucket 1000 as described in the mail.

        Did u tried with 6.5.0-4578

        Sitaram.Vemulapalli Sitaram Vemulapalli added a comment - Will take look later. Did you remove all unwanted steps and reduce documents in the bucket 1000 as described in the mail. Did u tried with 6.5.0-4578

        Sitaram Vemulapalli , I started this test yesterday so it has all steps.

        Will try on 6.5.0-4578 when I get clusters again.

        mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - Sitaram Vemulapalli  , I started this test yesterday so it has all steps. Will try on 6.5.0-4578 when I get clusters again.
        Sitaram.Vemulapalli Sitaram Vemulapalli added a comment - - edited

        I have looked 172.23.97.20 where highest files open files. From the info no unusall this i can find.

        wc -l lsof.txt
        256690 lsof.txt

        cbq-engine used 50%
        grep cbq-engin lsof.txt | wc -l
        133792

        25% used establish connection to memcached (The files are not our control and underneath go lang uses them)

        grep cbq-engin lsof.txt | grep "11210" | wc -l
        74793

        6% used by jamlink ( https://www.speedguide.net/port.php?port=8091), 8091 port
        grep cbq-engin lsof.txt | grep "jamlink" | wc -l
        15171

        6% used by bacula-dir (I think firewalls)

        grep cbq-engin lsof.txt | grep "bacula-dir" | wc -l
        14659

        According to netstat only 794 connections are established.

        grep 11210 neststat.txt | wc -l
        794

        The gorutine dump seems not right not sure how it collected. Lot of mandatory go routines missing.
        Use following command.

        curl -u Administrator:password http://localhost:8093/debug/pprof/goroutine?debug=2

        The collection of these must be when it failed not later because it gives correct picture to correlate. go routine dump will give what go routines are running.

        As mentioned Whole couchbase architecture is on REST/HTTP which uses connection pooling to make it faster. So which may need more files. Also of Instead limiting files set unlimited and try.

        Sitaram.Vemulapalli Sitaram Vemulapalli added a comment - - edited I have looked 172.23.97.20 where highest files open files. From the info no unusall this i can find. wc -l lsof.txt 256690 lsof.txt cbq-engine used 50% grep cbq-engin lsof.txt | wc -l 133792 25% used establish connection to memcached (The files are not our control and underneath go lang uses them) grep cbq-engin lsof.txt | grep "11210" | wc -l 74793 6% used by jamlink ( https://www.speedguide.net/port.php?port=8091 ), 8091 port grep cbq-engin lsof.txt | grep "jamlink" | wc -l 15171 6% used by bacula-dir (I think firewalls) grep cbq-engin lsof.txt | grep "bacula-dir" | wc -l 14659 According to netstat only 794 connections are established. grep 11210 neststat.txt | wc -l 794 The gorutine dump seems not right not sure how it collected. Lot of mandatory go routines missing. Use following command. curl -u Administrator:password http://localhost:8093/debug/pprof/goroutine?debug=2 The collection of these must be when it failed not later because it gives correct picture to correlate. go routine dump will give what go routines are running. As mentioned Whole couchbase architecture is on REST/HTTP which uses connection pooling to make it faster. So which may need more files. Also of Instead limiting files set unlimited and try.

        Build 6.5.0-4959

        Not able to find a way to set open files limit to unlimited, found that it can set maximum to 1048576(https://stackoverflow.com/a/1213069/2320823). Let me know if there is any way to set it unlimited.
        Tried setting open files limit to 1048575 on index+query nodes and ran the test, swap rebalance failed with similar error.
        Job- http://perf.jenkins.couchbase.com/job/arke-multi-bucket/343 
        Not able to collect go routine dump when swap rebalance happening, will try next time.
        Previously collected go routine dump was with- http://localhost:8093/debug/pprof/goroutine?debug=1 

        mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - Build 6.5.0-4959 Not able to find a way to set open files limit to unlimited, found that it can set maximum to 1048576( https://stackoverflow.com/a/1213069/2320823 ). Let me know if there is any way to set it unlimited. Tried setting open files limit to 1048575 on index+query nodes and ran the test, swap rebalance failed with similar error. Job- http://perf.jenkins.couchbase.com/job/arke-multi-bucket/343   Not able to collect go routine dump when swap rebalance happening, will try next time. Previously collected go routine dump was with- http://localhost:8093/debug/pprof/goroutine?debug=1  
        marco.greco Marco Greco added a comment -

        Mahesh Mandhare this is not something that can be fixed easily - it requires rearchitecting the layer that connects n1ql and memcached to switch from using a collection pool per each node+bucket combination to having connection pools on nodes only.
        To make it more complicated, this layer is used by other components as well.
        Also, the GSI client would have to be rewritten in a similar fashion, and FTS.

        Currently the story is - the more buckets you have, the more file descriptors you are going to use.
        I'm sure we will at some stage revisit this - for collections and TCO, but the amount of work required is substantial, so it's not something that we would e able to do for mad-hatter.
        Even cheshire-cat is not a given.

        marco.greco Marco Greco added a comment - Mahesh Mandhare this is not something that can be fixed easily - it requires rearchitecting the layer that connects n1ql and memcached to switch from using a collection pool per each node+bucket combination to having connection pools on nodes only. To make it more complicated, this layer is used by other components as well. Also, the GSI client would have to be rewritten in a similar fashion, and FTS. Currently the story is - the more buckets you have, the more file descriptors you are going to use. I'm sure we will at some stage revisit this - for collections and TCO, but the amount of work required is substantial, so it's not something that we would e able to do for mad-hatter. Even cheshire-cat is not a given.

        People

          jyotsna.nayak Jyotsna Nayak
          mahesh.mandhare Mahesh Mandhare (Inactive)
          Votes:
          0 Vote for this issue
          Watchers:
          11 Start watching this issue

          Dates

            Created:
            Updated:

            Gerrit Reviews

              There are no open Gerrit changes

              PagerDuty