Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-32545

Swap index+n1ql node rebalance failed

    XMLWordPrintable

Details

    • Untriaged
    • Unknown

    Attachments

      1. cbq-engine_go_dump_172-23-97-15.txt
        60 kB
      2. cbq-engine_go_dump_172-23-97-19.txt
        60 kB
      3. cbq-engine_go_dump_172-23-97-20.txt
        63 kB
      4. info.zip
        6.69 MB
      5. lsof_6.5.0-4059_172-23-97-20.txt
        31.33 MB
      6. lsof.txt
        48.66 MB
      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

      Activity

        mahesh.mandhare Mahesh Mandhare (Inactive) created issue -
        raju Raju Suravarjjala made changes -
        Field Original Value New Value
        Fix Version/s Mad-Hatter [ 15037 ]

        This is nothing to do with query. You should check which log has error and transfer to that component.

        Sitaram.Vemulapalli Sitaram Vemulapalli added a comment - This is nothing to do with query. You should check which log has error and transfer to that component.
        Sitaram.Vemulapalli Sitaram Vemulapalli made changes -
        Component/s query [ 10718 ]
        Sitaram.Vemulapalli Sitaram Vemulapalli made changes -
        Assignee Keshav Murthy [ keshav ] Jeelan Poola [ jeelan.poola ]
        jeelan.poola Jeelan Poola made changes -
        Assignee Jeelan Poola [ jeelan.poola ] Varun Velamuri [ varun.velamuri ]

        Keshav Murthy can you please reassign appropriately?

        shivani.gupta Shivani Gupta added a comment - Keshav Murthy can you please reassign appropriately?
        Sitaram.Vemulapalli Sitaram Vemulapalli added a comment - - edited

        It already assigned to indexer team take look.

        Query service is transient what ever ns_server brings up we will start

        Sitaram.Vemulapalli Sitaram Vemulapalli added a comment - - edited It already assigned to indexer team take look. Query service is transient what ever ns_server brings up we will start
        mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - Observed this issue again in same test. Job- http://perf.jenkins.couchbase.com/job/arke-multi-bucket/266 Indexer logs- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-266/172.23.97.15.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-266/172.23.97.19.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-266/172.23.97.20.zip
        shivani.gupta Shivani Gupta made changes -
        Priority Major [ 3 ] Critical [ 2 ]
        shivani.gupta Shivani Gupta added a comment - - edited

        Varun Velamuri can we have an update on this?

        We cannot certify 30 buckets until we know more about this issue and whether retrying is an acceptable  workaround.

        Keshav Murthy please help with prioritizing this.

        shivani.gupta Shivani Gupta added a comment - - edited Varun Velamuri can we have an update on this? We cannot certify 30 buckets until we know more about this issue and whether retrying is an acceptable  workaround. Keshav Murthy please help with prioritizing this.

        RMS work (reliability-manageability-serviceability) work is the highest priority for GSI team.

        Jeelan Poola is aware of this and will prioritize accordingly.

        keshav Keshav Murthy added a comment - RMS work (reliability-manageability-serviceability) work is the highest priority for GSI team. Jeelan Poola is aware of this and will prioritize accordingly.
        jliang John Liang made changes -
        Assignee Varun Velamuri [ varun.velamuri ] Deepkaran Salooja [ deepkaran.salooja ]

        The nodes need to be configured with higher limits for open fds.

        couchbase user limits
        su couchbase -s /bin/sh -c "ulimit -a"
        ==============================================================================
        open files                      (-n) 1024
        

        When the limit gets exhausted, below TCP error would be seen:

         dial tcp 127.0.0.1:9102: connect: cannot assign requested address
        

        deepkaran.salooja Deepkaran Salooja added a comment - The nodes need to be configured with higher limits for open fds. couchbase user limits su couchbase -s /bin/sh -c "ulimit -a" ============================================================================== open files (-n) 1024 When the limit gets exhausted, below TCP error would be seen: dial tcp 127.0.0.1:9102: connect: cannot assign requested address
        deepkaran.salooja Deepkaran Salooja made changes -
        Resolution Won't Fix [ 2 ]
        Status Open [ 1 ] Resolved [ 5 ]
        deepkaran.salooja Deepkaran Salooja made changes -
        Actual End 2019-02-05 17:40 (issue has been resolved)
        wayne Wayne Siu made changes -
        Assignee Deepkaran Salooja [ deepkaran.salooja ] Mahesh Mandhare [ mahesh.mandhare ]

        Build 6.5.0-2433

        Increased open files limit to 4096 and ran the same test. This time it failed in rebalance out scenario before swap rebalance with same reason.

        Rebalance exited with reason {service_rebalance_failed,index,
                                      {rebalance_failed,
                                       {service_error,
                                        <<"Post http://127.0.0.1:9102/createIndexRebalance: dial tcp 127.0.0.1:9102: connect: cannot assign requested address">>}}}
        
        

        Logs-
        rebalance out node- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-290/172.23.97.15.zip
        https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-290/172.23.97.19.zip
        https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-290/172.23.97.20.zip

        mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - Build 6.5.0-2433 Increased open files limit to 4096 and ran the same test. This time it failed in rebalance out scenario before swap rebalance with same reason. Rebalance exited with reason {service_rebalance_failed,index, {rebalance_failed, {service_error, <<"Post http://127.0.0.1:9102/createIndexRebalance: dial tcp 127.0.0.1:9102: connect: cannot assign requested address">>}}} Logs- rebalance out node- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-290/172.23.97.15.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-290/172.23.97.19.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-290/172.23.97.20.zip
        mahesh.mandhare Mahesh Mandhare (Inactive) made changes -
        Assignee Mahesh Mandhare [ mahesh.mandhare ] Deepkaran Salooja [ deepkaran.salooja ]
        Resolution Won't Fix [ 2 ]
        Status Resolved [ 5 ] Reopened [ 4 ]

        Based on the lsof output in couchbase.log for 172.23.97.19, here are the major consumers of file handles(couchbase processes only):

        1140 cbq-engine
        440 indexer
        325 memcached
        225 beam.smp

        I don't know what workload is being run on the system. All of these may be legitimate usage.

        Based on our documentation, our installer tries to set the file handle limit to 40960. So I think you can go higher.
        https://docs.couchbase.com/server/5.5/troubleshooting/troubleshooting-common-errors.html

        deepkaran.salooja Deepkaran Salooja added a comment - Based on the lsof output in couchbase.log for 172.23.97.19, here are the major consumers of file handles(couchbase processes only): 1140 cbq-engine 440 indexer 325 memcached 225 beam.smp I don't know what workload is being run on the system. All of these may be legitimate usage. Based on our documentation, our installer tries to set the file handle limit to 40960. So I think you can go higher. https://docs.couchbase.com/server/5.5/troubleshooting/troubleshooting-common-errors.html
        deepkaran.salooja Deepkaran Salooja made changes -
        Assignee Deepkaran Salooja [ deepkaran.salooja ] Mahesh Mandhare [ mahesh.mandhare ]

        Build 6.5.0-2640

        Ran the test with open file limit as 81920 and index + n1ql swap rebalance completed successfully.

        Job- http://perf.jenkins.couchbase.com/job/arke-multi-bucket/293 

        Deepkaran Salooja Shivani Gupta,

        do we need to create doc issue to capture this behaviour ?

        mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - Build 6.5.0-2640 Ran the test with open file limit as 81920 and index + n1ql swap rebalance completed successfully. Job- http://perf.jenkins.couchbase.com/job/arke-multi-bucket/293   Deepkaran Salooja Shivani Gupta , do we need to create doc issue to capture this behaviour ?
        mahesh.mandhare Mahesh Mandhare (Inactive) made changes -
        Assignee Mahesh Mandhare [ mahesh.mandhare ] Deepkaran Salooja [ deepkaran.salooja ]
        shivani.gupta Shivani Gupta added a comment - - edited

        Mahesh Mandhare my question is why didn't the installer set it to 40960 in this case? Or did you bypass the installer?

        I don't think we have to document it if it is built in to the installer.

        Also, btw, the above documentation page seems to have disappeared from 6.0 documentation - that might have to be filed as a bug.

        shivani.gupta Shivani Gupta added a comment - - edited Mahesh Mandhare my question is why didn't the installer set it to 40960 in this case? Or did you bypass the installer? I don't think we have to document it if it is built in to the installer. Also, btw, the above documentation page seems to have disappeared from 6.0 documentation - that might have to be filed as a bug.
        mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - Build 6.5.0-2640 Ran the same test again with open file limit as 81920 and index + n1ql swap rebalance failed with same reason. Job- http://perf.jenkins.couchbase.com/job/arke-multi-bucket/294   Logs(Nodes have same roles as mentioned in description)- https://s3.amazonaws.com/bugdb/jira/hbd-logs-1/collectinfo-2019-03-26T141923-ns_1@172.23.97.12.zip https://s3.amazonaws.com/bugdb/jira/hbd-logs-1/collectinfo-2019-03-26T141923-ns_1@172.23.97.13.zip https://s3.amazonaws.com/bugdb/jira/hbd-logs-1/collectinfo-2019-03-26T141923-ns_1@172.23.97.14.zip   https://s3.amazonaws.com/bugdb/jira/hbd-logs-1/collectinfo-2019-03-26T141923-ns_1@172.23.97.15.zip https://s3.amazonaws.com/bugdb/jira/hbd-logs-1/collectinfo-2019-03-26T141923-ns_1@172.23.96.20.zip   https://s3.amazonaws.com/bugdb/jira/hbd-logs-1/collectinfo-2019-03-26T141923-ns_1@172.23.96.23.zip   https://s3.amazonaws.com/bugdb/jira/hbd-logs-1/collectinfo-2019-03-26T141923-ns_1@172.23.97.177.zip   https://s3.amazonaws.com/bugdb/jira/hbd-logs-1/collectinfo-2019-03-26T141923-ns_1@172.23.97.19.zip https://s3.amazonaws.com/bugdb/jira/hbd-logs-1/collectinfo-2019-03-26T141923-ns_1@172.23.97.20.zip  
        mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - Build 6.5.0-2802 Here are logs from latest run which have same issue- https://s3.amazonaws.com/bugdb/jira/hbd_2/collectinfo-2019-04-10T103209-ns_1%40172.23.96.20.zip   https://s3.amazonaws.com/bugdb/jira/hbd_2/collectinfo-2019-04-10T103209-ns_1%40172.23.96.23.zip https://s3.amazonaws.com/bugdb/jira/hbd_2/collectinfo-2019-04-10T103209-ns_1%40172.23.97.12.zip https://s3.amazonaws.com/bugdb/jira/hbd_2/collectinfo-2019-04-10T103209-ns_1%40172.23.97.13.zip https://s3.amazonaws.com/bugdb/jira/hbd_2/collectinfo-2019-04-10T103209-ns_1%40172.23.97.14.zip https://s3.amazonaws.com/bugdb/jira/hbd_2/collectinfo-2019-04-10T103209-ns_1%40172.23.97.15.zip https://s3.amazonaws.com/bugdb/jira/hbd_2/collectinfo-2019-04-10T103209-ns_1%40172.23.97.177.zip https://s3.amazonaws.com/bugdb/jira/hbd_2/collectinfo-2019-04-10T103209-ns_1%40172.23.97.19.zip https://s3.amazonaws.com/bugdb/jira/hbd_2/collectinfo-2019-04-10T103209-ns_1%40172.23.97.20.zip

        The failure is coming from node 172.23.97.20.
        In couchbase.log, I see below open files for couchbase processes:

        memcached - 330
        cbq-engine - 1416
        indexer - 425

        I cannot determine the file handle limit as below error is reported while collecting this information:

        cbcollect_info.log:[2019-04-10T18:28:14.991981+08:00] couchbase user limits (su couchbase -s /bin/sh -c "ulimit -a") - skipped (needs root privs)
        

        Is this a non-root/non-sudo installation? 

        I think we need to do the following:

        1. Run the command(su couchbase -s /bin/sh -c "ulimit -a") to verify that couchbase user limits are set correctly on each node.
        2. Manually run lsof on the indexer nodes after the rebalance failure to see where file handles are being used. couchbase.log only collects for couchbase processes.

        deepkaran.salooja Deepkaran Salooja added a comment - The failure is coming from node 172.23.97.20. In couchbase.log, I see below open files for couchbase processes: memcached - 330 cbq-engine - 1416 indexer - 425 I cannot determine the file handle limit as below error is reported while collecting this information: cbcollect_info.log:[2019-04-10T18:28:14.991981+08:00] couchbase user limits (su couchbase -s /bin/sh -c "ulimit -a") - skipped (needs root privs) Is this a non-root/non-sudo installation?  I think we need to do the following: 1. Run the command(su couchbase -s /bin/sh -c "ulimit -a") to verify that couchbase user limits are set correctly on each node. 2. Manually run lsof on the indexer nodes after the rebalance failure to see where file handles are being used. couchbase.log only collects for couchbase processes.
        deepkaran.salooja Deepkaran Salooja made changes -
        Assignee Deepkaran Salooja [ deepkaran.salooja ] Mahesh Mandhare [ mahesh.mandhare ]

        Here is output for first to do on node 172.23.97.20 :

        [root@ark08-sa11 ~]# su couchbase -s /bin/sh -c "ulimit -a"
        core file size (blocks, -c) 0
        data seg size (kbytes, -d) unlimited
        scheduling priority (-e) 0
        file size (blocks, -f) unlimited
        pending signals (-i) 1288640
        max locked memory (kbytes, -l) 64
        max memory size (kbytes, -m) unlimited
        open files (-n) 81920
        pipe size (512 bytes, -p) 8
        POSIX message queues (bytes, -q) 819200
        real-time priority (-r) 0
        stack size (kbytes, -s) 8192
        cpu time (seconds, -t) unlimited
        max user processes (-u) 4096
        virtual memory (kbytes, -v) unlimited
        file locks (-x) unlimited
        

        Will run lsof command after failure, next time I run the test.

        mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - Here is output for first to do on node 172.23.97.20 : [root@ark08-sa11 ~]# su couchbase -s /bin/sh -c "ulimit -a" core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1288640 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 81920 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 4096 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Will run lsof command after failure, next time I run the test.
        mahesh.mandhare Mahesh Mandhare (Inactive) made changes -
        Attachment lsof.txt [ 70182 ]

        Build 6.5.0-3667

        We have seen swap rebalance failure again on high bucket density test.

        Attaching lost o/p file.

         Here is ulimit -a o/p:

        [root@ark08-sa11 ~]# su couchbase -s /bin/sh -c "ulimit -a"
        core file size (blocks, -c) 0
        data seg size (kbytes, -d) unlimited
        scheduling priority (-e) 0
        file size (blocks, -f) unlimited
        pending signals (-i) 1288640
        max locked memory (kbytes, -l) 64
        max memory size (kbytes, -m) unlimited
        open files (-n) 81920
        pipe size (512 bytes, -p) 8
        POSIX message queues (bytes, -q) 819200
        real-time priority (-r) 0
        stack size (kbytes, -s) 8192
        cpu time (seconds, -t) unlimited
        max user processes (-u) 4096
        virtual memory (kbytes, -v) unlimited
        file locks (-x) unlimited
        
        

        Logs:
        https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-312/172.23.97.15.zip
        https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-312/172.23.97.19.zip
        https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-312/172.23.97.20.zip

        mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - - edited Build 6.5.0-3667 We have seen swap rebalance failure again on high bucket density test. Attaching lost o/p file.  Here is ulimit -a o/p: [root@ark08-sa11 ~]# su couchbase -s /bin/sh -c "ulimit -a" core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1288640 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 81920 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 4096 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Logs: https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-312/172.23.97.15.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-312/172.23.97.19.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-312/172.23.97.20.zip
        mahesh.mandhare Mahesh Mandhare (Inactive) made changes -
        Assignee Mahesh Mandhare [ mahesh.mandhare ] Deepkaran Salooja [ deepkaran.salooja ]

        From the attached lsof.txt, cbq-engine has 67731 open connections to memcached.

        mulb > grep cbq-engin lsof.txt > cbq.txt
        mulb > grep 11210 cbq.txt | grep ESTABLISHED |wc -l
           67731
        mulb > grep 11210 cbq.txt | grep ESTABLISHED |head
        cbq-engin  29713        couchbase   25u     IPv4           16831690       0t0        TCP k8s-master-1:44072->k8s-master-1:11210 (ESTABLISHED)
        cbq-engin  29713        couchbase   26u     IPv4           16834667       0t0        TCP k8s-master-1:44074->k8s-master-1:11210 (ESTABLISHED)
        cbq-engin  29713        couchbase   27u     IPv4           16834668       0t0        TCP k8s-master-1:44076->k8s-master-1:11210 (ESTABLISHED)
        cbq-engin  29713        couchbase   28u     IPv4           16809758       0t0        TCP k8s-master-1:44078->k8s-master-1:11210 (ESTABLISHED)
        cbq-engin  29713        couchbase   29u     IPv4           16834669       0t0        TCP k8s-master-1:44080->k8s-master-1:11210 (ESTABLISHED)
        cbq-engin  29713        couchbase   39u     IPv4           22578368       0t0        TCP k8s-master-1:50386->k8s-worker-2:11210 (ESTABLISHED)
        cbq-engin  29713        couchbase   40u     IPv4           26647300       0t0        TCP k8s-master-1:35222->k8s-worker-3:11210 (ESTABLISHED)
        cbq-engin  29713        couchbase   43u     IPv4           22529880       0t0        TCP k8s-master-1:50384->k8s-worker-2:11210 (ESTABLISHED)
        cbq-engin  29713        couchbase   45u     IPv4           22565502       0t0        TCP k8s-master-1:50385->k8s-worker-2:11210 (ESTABLISHED)
        cbq-engin  29713        couchbase   52u     IPv4           22580104       0t0        TCP k8s-master-1:50388->k8s-worker-2:11210 (ESTABLISHED)
        

        The node has 81900 open handle limit.

        [root@ark08-sa11 ~]# su couchbase -s /bin/sh -c "ulimit -a"
        open files (-n) 81920
        

        The open file limit got exhausted due to cbq-engine using 67731 of those and this led to rebalance failure

         {log_entry,{1562,234031,246086},
                    'ns_1@172.23.96.20',ns_orchestrator,0,
                    [<<"Rebalance exited with reason {service_rebalance_failed,index,\n                              {rebalance_failed,\n
                         {service_error,\n                                <<\"Post http://127.0.0.1:9102/dropIndex: dial tcp 127.0.0.1:9102: connect: cannot assign requested address\">>}}}.\nRebalance Operation Id = d7d3fa5796a8e533c227863e12681b5f">>],
        

        Query team, please check why cbq-engine is having so many open connections with memcached.

        deepkaran.salooja Deepkaran Salooja added a comment - From the attached lsof.txt, cbq-engine has 67731 open connections to memcached. mulb > grep cbq-engin lsof.txt > cbq.txt mulb > grep 11210 cbq.txt | grep ESTABLISHED |wc -l 67731 mulb > grep 11210 cbq.txt | grep ESTABLISHED |head cbq-engin 29713 couchbase 25u IPv4 16831690 0t0 TCP k8s-master-1:44072->k8s-master-1:11210 (ESTABLISHED) cbq-engin 29713 couchbase 26u IPv4 16834667 0t0 TCP k8s-master-1:44074->k8s-master-1:11210 (ESTABLISHED) cbq-engin 29713 couchbase 27u IPv4 16834668 0t0 TCP k8s-master-1:44076->k8s-master-1:11210 (ESTABLISHED) cbq-engin 29713 couchbase 28u IPv4 16809758 0t0 TCP k8s-master-1:44078->k8s-master-1:11210 (ESTABLISHED) cbq-engin 29713 couchbase 29u IPv4 16834669 0t0 TCP k8s-master-1:44080->k8s-master-1:11210 (ESTABLISHED) cbq-engin 29713 couchbase 39u IPv4 22578368 0t0 TCP k8s-master-1:50386->k8s-worker-2:11210 (ESTABLISHED) cbq-engin 29713 couchbase 40u IPv4 26647300 0t0 TCP k8s-master-1:35222->k8s-worker-3:11210 (ESTABLISHED) cbq-engin 29713 couchbase 43u IPv4 22529880 0t0 TCP k8s-master-1:50384->k8s-worker-2:11210 (ESTABLISHED) cbq-engin 29713 couchbase 45u IPv4 22565502 0t0 TCP k8s-master-1:50385->k8s-worker-2:11210 (ESTABLISHED) cbq-engin 29713 couchbase 52u IPv4 22580104 0t0 TCP k8s-master-1:50388->k8s-worker-2:11210 (ESTABLISHED) The node has 81900 open handle limit. [root@ark08-sa11 ~]# su couchbase -s /bin/sh -c "ulimit -a" open files (-n) 81920 The open file limit got exhausted due to cbq-engine using 67731 of those and this led to rebalance failure {log_entry,{1562,234031,246086}, 'ns_1@172.23.96.20',ns_orchestrator,0, [<<"Rebalance exited with reason {service_rebalance_failed,index,\n {rebalance_failed,\n {service_error,\n <<\"Post http://127.0.0.1:9102/dropIndex: dial tcp 127.0.0.1:9102: connect: cannot assign requested address\">>}}}.\nRebalance Operation Id = d7d3fa5796a8e533c227863e12681b5f">>], Query team, please check why cbq-engine is having so many open connections with memcached.
        deepkaran.salooja Deepkaran Salooja made changes -
        Assignee Deepkaran Salooja [ deepkaran.salooja ] Keshav Murthy [ keshav ]
        deepkaran.salooja Deepkaran Salooja made changes -
        Component/s query [ 10718 ]
        Component/s secondary-index [ 11211 ]
        keshav Keshav Murthy made changes -
        Assignee Keshav Murthy [ keshav ] Isha Kandaswamy [ isha ]
        keshav Keshav Murthy made changes -
        Assignee Isha Kandaswamy [ isha ] Sitaram Vemulapalli [ sitaram.vemulapalli ]

        MB-33170 should have taken care of it. This has been over 2 months can you rerun the test.

        Collect how many cbq-engine connections opened before rebalance and during after rebalance.
        If problem occurs collect go routine dump of cbq-engine and provide the cluster details will take look.

        Sitaram.Vemulapalli Sitaram Vemulapalli added a comment - MB-33170 should have taken care of it. This has been over 2 months can you rerun the test. Collect how many cbq-engine connections opened before rebalance and during after rebalance. If problem occurs collect go routine dump of cbq-engine and provide the cluster details will take look.
        Sitaram.Vemulapalli Sitaram Vemulapalli made changes -
        Assignee Sitaram Vemulapalli [ sitaram.vemulapalli ] Mahesh Mandhare [ mahesh.mandhare ]
        mahesh.mandhare Mahesh Mandhare (Inactive) made changes -

        Build 6.5.0-4059

        Observed that swap rebalance of index+n1ql node is failed with error-

        Rebalance exited with reason {service_rebalance_failed,index,
        {rebalance_failed,
        {service_error,
        <<"Post http://127.0.0.1:9102/buildIndex: dial tcp 127.0.0.1:9102: connect: cannot assign requested address">>}}}.
        Rebalance Operation Id = c92552ca70db6277c2402bb98db8a912

        Attaching lsof command output for one of index+n1ql node, and cbq-engine go routine dump - lsof_6.5.0-4059_172-23-97-20.txtcbq-engine_go_dump_172-23-97-15.txtcbq-engine_go_dump_172-23-97-19.txtcbq-engine_go_dump_172-23-97-20.txt

        Logs-

        https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-318/172.23.97.15.zip

        https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-318/172.23.97.19.zip

        https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-318/172.23.97.20.zip

        Job- http://perf.jenkins.couchbase.com/job/arke-multi-bucket/318/

        mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - Build 6.5.0-4059 Observed that swap rebalance of index+n1ql node is failed with error- Rebalance exited with reason {service_rebalance_failed,index, {rebalance_failed, {service_error, <<"Post http://127.0.0.1:9102/buildIndex: dial tcp 127.0.0.1:9102: connect: cannot assign requested address">>}}}. Rebalance Operation Id = c92552ca70db6277c2402bb98db8a912 Attaching lsof command output for one of index+n1ql node, and cbq-engine go routine dump - lsof_6.5.0-4059_172-23-97-20.txt ,  cbq-engine_go_dump_172-23-97-15.txt ,  cbq-engine_go_dump_172-23-97-19.txt ,  cbq-engine_go_dump_172-23-97-20.txt Logs- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-318/172.23.97.15.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-318/172.23.97.19.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-318/172.23.97.20.zip Job-  http://perf.jenkins.couchbase.com/job/arke-multi-bucket/318/
        mahesh.mandhare Mahesh Mandhare (Inactive) made changes -
        Assignee Mahesh Mandhare [ mahesh.mandhare ] Sitaram Vemulapalli [ sitaram.vemulapalli ]
        Sitaram.Vemulapalli Sitaram Vemulapalli added a comment - - edited

        Could you please tell me what queries you run and what steps you perform.
        how to setup 30 buckets and all queries
        I need the cluster for long time would able to run the steps and repro. I may have to do instrumentation

        Sitaram.Vemulapalli Sitaram Vemulapalli added a comment - - edited Could you please tell me what queries you run and what steps you perform. how to setup 30 buckets and all queries I need the cluster for long time would able to run the steps and repro. I may have to do instrumentation

        Here are details-

        Queries:
        We have partitioned indexes, each bucket have 2 indexes. Index definition looks like-

        CREATE INDEX idx11 ON `bucket-1`(alt_email, name) PARTITION BY HASH(META().id) WITH {"defer_build": true};
        CREATE INDEX idx12 ON `bucket-1`(alt_email, city) PARTITION BY HASH(META().id) WITH {"defer_build": true};
        

        Queries run on all buckets and they look like-

        SELECT email, alt_email FROM `bucket-1` WHERE alt_email > 'a0' limit 1;
        

        Test file to get definition of all indexes and queries - https://github.com/couchbase/perfrunner/blob/master/tests/multi_bucket/KS_P0_30bucket.test

        Here is rest API call to increase max buckets on CB cluster - https://github.com/couchbase/perfrunner/blob/master/perfrunner/helpers/rest.py#L245

        As of now we merge 2 cluster to create big cluster to run this tests, hence we do not have setup/repro available for long time.
         

        mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - - edited Here are details- Queries: We have partitioned indexes, each bucket have 2 indexes. Index definition looks like- CREATE INDEX idx11 ON `bucket-1`(alt_email, name) PARTITION BY HASH(META().id) WITH {"defer_build": true}; CREATE INDEX idx12 ON `bucket-1`(alt_email, city) PARTITION BY HASH(META().id) WITH {"defer_build": true}; Queries run on all buckets and they look like- SELECT email, alt_email FROM `bucket-1` WHERE alt_email > 'a0' limit 1; Test file to get definition of all indexes and queries - https://github.com/couchbase/perfrunner/blob/master/tests/multi_bucket/KS_P0_30bucket.test Here is rest API call to increase max buckets on CB cluster - https://github.com/couchbase/perfrunner/blob/master/perfrunner/helpers/rest.py#L245 As of now we merge 2 cluster to create big cluster to run this tests, hence we do not have setup/repro available for long time.  

        Looked issue extensively. Nothing unusual found. No connections leaking. As far file descriptor concern N1QL will not use any files all the TCP connections uses file descriptors.

        • Connections to KV
        • Incoming REST HTTP requests
        • HTTP connections to ns server for meta data
        • Remote query service HTTP connections
        • GSI Client to indexer 
        • GSI client to ns server
        • FTS client to ns server
        • FTS client to FTS indexer
        • Even more

         

        Under neath HTTP does connection pooling for performance and number of buckets increases more file descriptors are needed. Please increase or set unlimited files 

         

        Sitaram.Vemulapalli Sitaram Vemulapalli added a comment - Looked issue extensively. Nothing unusual found. No connections leaking. As far file descriptor concern N1QL will not use any files all the TCP connections uses file descriptors. Connections to KV Incoming REST HTTP requests HTTP connections to ns server for meta data Remote query service HTTP connections GSI Client to indexer  GSI client to ns server FTS client to ns server FTS client to FTS indexer Even more   Under neath HTTP does connection pooling for performance and number of buckets increases more file descriptors are needed. Please increase or set unlimited files   
        Sitaram.Vemulapalli Sitaram Vemulapalli made changes -
        Assignee Sitaram Vemulapalli [ sitaram.vemulapalli ] Mahesh Mandhare [ mahesh.mandhare ]

        Mahesh Mandhare Please let us know how the test went after increasing the max number of file descriptors.

        keshav Keshav Murthy added a comment - Mahesh Mandhare Please let us know how the test went after increasing the max number of file descriptors.
        keshav Keshav Murthy made changes -
        Resolution Done [ 6 ]
        Status Reopened [ 4 ] Resolved [ 5 ]

        Bulk closing all invalid, duplicate and won't fix bugs. Please feel free to reopen them

        raju Raju Suravarjjala added a comment - Bulk closing all invalid, duplicate and won't fix bugs. Please feel free to reopen them
        raju Raju Suravarjjala made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

        Build 6.5.0-4744

        Observed similar rebalance failure even after increasing open files limit to 1000000.

        Here is ulimit -a o/p on one of index+n1ql node:

        [root@arke07-sa10 ~]# su couchbase -s /bin/sh -c "ulimit -a"
        core file size          (blocks, -c) 0
        data seg size           (kbytes, -d) unlimited
        scheduling priority             (-e) 0
        file size               (blocks, -f) unlimited
        pending signals                 (-i) 1288637
        max locked memory       (kbytes, -l) 64
        max memory size         (kbytes, -m) unlimited
        open files                      (-n) 1000000
        pipe size            (512 bytes, -p) 8
        POSIX message queues     (bytes, -q) 819200
        real-time priority              (-r) 0
        stack size              (kbytes, -s) 8192
        cpu time               (seconds, -t) unlimited
        max user processes              (-u) 4096
        virtual memory          (kbytes, -v) unlimited
        file locks                      (-x) unlimited

         

        Job- http://perf.jenkins.couchbase.com/job/arke-multi-bucket/330/

        Logs-

        Index+query nodes-

        https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-330/172.23.97.15.zip 
        https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-330/172.23.97.19.zip 
        https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-330/172.23.97.20.zip 

        Other nodes-

        https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-330/172.23.96.16.zip 
        https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-330/172.23.96.17.zip 
        https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-330/172.23.96.20.zip 
        https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-330/172.23.96.23.zip 
        https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-330/172.23.97.12.zip 
        https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-330/172.23.97.13.zip 

        https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-330/172.23.97.14.zip

        https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-330/172.23.97.177.zip 

         

        Note that in last run on 6.5.0-4558, this issue did not reproduced - http://perf.jenkins.couchbase.com/job/arke-multi-bucket/325/

        mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - - edited Build 6.5.0-4744 Observed similar rebalance failure even after increasing open files limit to 1000000. Here is ulimit -a o/p on one of index+n1ql node: [root@arke07-sa10 ~]# su couchbase -s /bin/sh -c "ulimit -a" core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1288637 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 1000000 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 4096 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited   Job- http://perf.jenkins.couchbase.com/job/arke-multi-bucket/330/ Logs- Index+query nodes- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-330/172.23.97.15.zip   https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-330/172.23.97.19.zip   https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-330/172.23.97.20.zip   Other nodes- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-330/172.23.96.16.zip   https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-330/172.23.96.17.zip   https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-330/172.23.96.20.zip   https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-330/172.23.96.23.zip   https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-330/172.23.97.12.zip   https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-330/172.23.97.13.zip   https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-330/172.23.97.14.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-330/172.23.97.177.zip     Note that in last run on 6.5.0-4558, this issue did not reproduced -  http://perf.jenkins.couchbase.com/job/arke-multi-bucket/325/
        mahesh.mandhare Mahesh Mandhare (Inactive) made changes -
        Resolution Done [ 6 ]
        Status Closed [ 6 ] Reopened [ 4 ]
        mahesh.mandhare Mahesh Mandhare (Inactive) made changes -
        Assignee Mahesh Mandhare [ mahesh.mandhare ] Sitaram Vemulapalli [ sitaram.vemulapalli ]
        Sitaram.Vemulapalli Sitaram Vemulapalli added a comment - - edited

        There is nothing changed much from 6.5.0-4558. As earlier mentioned query service never use any files except connections. Nothing can be done at present with connections.

        You should not open old defects repeatedly, if it pass and re-appear.

        If you consistently reproduce narrow down the build that caused the issue. First check with 6.5.0-4578

        Sitaram.Vemulapalli Sitaram Vemulapalli added a comment - - edited There is nothing changed much from 6.5.0-4558. As earlier mentioned query service never use any files except connections. Nothing can be done at present with connections. You should not open old defects repeatedly, if it pass and re-appear. If you consistently reproduce narrow down the build that caused the issue. First check with 6.5.0-4578
        Sitaram.Vemulapalli Sitaram Vemulapalli made changes -
        Assignee Sitaram Vemulapalli [ sitaram.vemulapalli ] Kamini Jagtiani [ kamini.jagtiani ]
        kamini.jagtiani Kamini Jagtiani added a comment - - edited

        Mahesh- Can you please help narrow down the build where you see it happen?

        kamini.jagtiani Kamini Jagtiani added a comment - - edited Mahesh- Can you please help narrow down the build where you see it happen?
        Sitaram.Vemulapalli Sitaram Vemulapalli made changes -
        Assignee Kamini Jagtiani [ kamini.jagtiani ] Mahesh Mandhare [ mahesh.mandhare ]

        Kamini Jagtiani [5:58 AM]
        Hi Mahesh - any update on this bug(MB-32545)?

        Mahesh Mandhare [6:24 AM]
        Tried running test, we are hitting some other issue in setup itself on hight bucket density with latest build.

        kamini.jagtiani Kamini Jagtiani added a comment - Kamini Jagtiani [5:58 AM] Hi Mahesh - any update on this bug( MB-32545 )? Mahesh Mandhare [6:24 AM] Tried running test, we are hitting some other issue in setup itself on hight bucket density with latest build.
        lynn.straus Lynn Straus made changes -
        Fix Version/s Cheshire-Cat [ 15915 ]
        Fix Version/s Mad-Hatter [ 15037 ]
        lynn.straus Lynn Straus made changes -
        Labels high-bucket-density deferred-from-Mad-Hatter high-bucket-density
        mahesh.mandhare Mahesh Mandhare (Inactive) made changes -
        Attachment info.zip [ 77960 ]

        When I tried to run this test on latest build 6.5.0-4841, I started hitting MB-36973

        So tried rerun on 6.5.0-4744, first swap rebalance was successful and second swap rebalance failed - http://perf.jenkins.couchbase.com/job/arke-multi-bucket/337/

        I collected lsof, goroutine info and nestat o/p during first rebalance info.zip if it is of any use, next time will try to collect this information during second rebalance.

        mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - When I tried to run this test on latest build 6.5.0-4841, I started hitting  MB-36973 So tried rerun on 6.5.0-4744, first swap rebalance was successful and second swap rebalance failed -  http://perf.jenkins.couchbase.com/job/arke-multi-bucket/337/ I collected lsof, goroutine info and nestat o/p during first rebalance info.zip  if it is of any use, next time will try to collect this information during second rebalance.

        Will take look later. Did you remove all unwanted steps and reduce documents in the bucket 1000 as described in the mail.

        Did u tried with 6.5.0-4578

        Sitaram.Vemulapalli Sitaram Vemulapalli added a comment - Will take look later. Did you remove all unwanted steps and reduce documents in the bucket 1000 as described in the mail. Did u tried with 6.5.0-4578

        Sitaram Vemulapalli , I started this test yesterday so it has all steps.

        Will try on 6.5.0-4578 when I get clusters again.

        mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - Sitaram Vemulapalli  , I started this test yesterday so it has all steps. Will try on 6.5.0-4578 when I get clusters again.
        Sitaram.Vemulapalli Sitaram Vemulapalli added a comment - - edited

        I have looked 172.23.97.20 where highest files open files. From the info no unusall this i can find.

        wc -l lsof.txt
        256690 lsof.txt

        cbq-engine used 50%
        grep cbq-engin lsof.txt | wc -l
        133792

        25% used establish connection to memcached (The files are not our control and underneath go lang uses them)

        grep cbq-engin lsof.txt | grep "11210" | wc -l
        74793

        6% used by jamlink ( https://www.speedguide.net/port.php?port=8091), 8091 port
        grep cbq-engin lsof.txt | grep "jamlink" | wc -l
        15171

        6% used by bacula-dir (I think firewalls)

        grep cbq-engin lsof.txt | grep "bacula-dir" | wc -l
        14659

        According to netstat only 794 connections are established.

        grep 11210 neststat.txt | wc -l
        794

        The gorutine dump seems not right not sure how it collected. Lot of mandatory go routines missing.
        Use following command.

        curl -u Administrator:password http://localhost:8093/debug/pprof/goroutine?debug=2

        The collection of these must be when it failed not later because it gives correct picture to correlate. go routine dump will give what go routines are running.

        As mentioned Whole couchbase architecture is on REST/HTTP which uses connection pooling to make it faster. So which may need more files. Also of Instead limiting files set unlimited and try.

        Sitaram.Vemulapalli Sitaram Vemulapalli added a comment - - edited I have looked 172.23.97.20 where highest files open files. From the info no unusall this i can find. wc -l lsof.txt 256690 lsof.txt cbq-engine used 50% grep cbq-engin lsof.txt | wc -l 133792 25% used establish connection to memcached (The files are not our control and underneath go lang uses them) grep cbq-engin lsof.txt | grep "11210" | wc -l 74793 6% used by jamlink ( https://www.speedguide.net/port.php?port=8091 ), 8091 port grep cbq-engin lsof.txt | grep "jamlink" | wc -l 15171 6% used by bacula-dir (I think firewalls) grep cbq-engin lsof.txt | grep "bacula-dir" | wc -l 14659 According to netstat only 794 connections are established. grep 11210 neststat.txt | wc -l 794 The gorutine dump seems not right not sure how it collected. Lot of mandatory go routines missing. Use following command. curl -u Administrator:password http://localhost:8093/debug/pprof/goroutine?debug=2 The collection of these must be when it failed not later because it gives correct picture to correlate. go routine dump will give what go routines are running. As mentioned Whole couchbase architecture is on REST/HTTP which uses connection pooling to make it faster. So which may need more files. Also of Instead limiting files set unlimited and try.

        Build 6.5.0-4959

        Not able to find a way to set open files limit to unlimited, found that it can set maximum to 1048576(https://stackoverflow.com/a/1213069/2320823). Let me know if there is any way to set it unlimited.
        Tried setting open files limit to 1048575 on index+query nodes and ran the test, swap rebalance failed with similar error.
        Job- http://perf.jenkins.couchbase.com/job/arke-multi-bucket/343 
        Not able to collect go routine dump when swap rebalance happening, will try next time.
        Previously collected go routine dump was with- http://localhost:8093/debug/pprof/goroutine?debug=1 

        mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - Build 6.5.0-4959 Not able to find a way to set open files limit to unlimited, found that it can set maximum to 1048576( https://stackoverflow.com/a/1213069/2320823 ). Let me know if there is any way to set it unlimited. Tried setting open files limit to 1048575 on index+query nodes and ran the test, swap rebalance failed with similar error. Job- http://perf.jenkins.couchbase.com/job/arke-multi-bucket/343   Not able to collect go routine dump when swap rebalance happening, will try next time. Previously collected go routine dump was with- http://localhost:8093/debug/pprof/goroutine?debug=1  
        kamini.jagtiani Kamini Jagtiani made changes -
        Assignee Mahesh Mandhare [ mahesh.mandhare ] Marco Greco [ marco.greco ]
        marco.greco Marco Greco added a comment -

        Mahesh Mandhare this is not something that can be fixed easily - it requires rearchitecting the layer that connects n1ql and memcached to switch from using a collection pool per each node+bucket combination to having connection pools on nodes only.
        To make it more complicated, this layer is used by other components as well.
        Also, the GSI client would have to be rewritten in a similar fashion, and FTS.

        Currently the story is - the more buckets you have, the more file descriptors you are going to use.
        I'm sure we will at some stage revisit this - for collections and TCO, but the amount of work required is substantial, so it's not something that we would e able to do for mad-hatter.
        Even cheshire-cat is not a given.

        marco.greco Marco Greco added a comment - Mahesh Mandhare this is not something that can be fixed easily - it requires rearchitecting the layer that connects n1ql and memcached to switch from using a collection pool per each node+bucket combination to having connection pools on nodes only. To make it more complicated, this layer is used by other components as well. Also, the GSI client would have to be rewritten in a similar fashion, and FTS. Currently the story is - the more buckets you have, the more file descriptors you are going to use. I'm sure we will at some stage revisit this - for collections and TCO, but the amount of work required is substantial, so it's not something that we would e able to do for mad-hatter. Even cheshire-cat is not a given.
        marco.greco Marco Greco made changes -
        Assignee Marco Greco [ marco.greco ] Mahesh Mandhare [ mahesh.mandhare ]
        raju Raju Suravarjjala made changes -
        Assignee Mahesh Mandhare [ mahesh.mandhare ] Sharath Sulochana [ sharath.sulochana ]
        raju Raju Suravarjjala made changes -
        Assignee Sharath Sulochana [ sharath.sulochana ] Korrigan Clark [ korrigan.clark ]
        wayne Wayne Siu made changes -
        Assignee Korrigan Clark [ korrigan.clark ] Prajwal‌ Kiran Kumar‌ [ prajwal.kirankumar ]
        kamini.jagtiani Kamini Jagtiani made changes -
        Fix Version/s bug-backlog [ 11600 ]
        Fix Version/s Cheshire-Cat [ 15915 ]
        wayne Wayne Siu made changes -
        Assignee Prajwal‌ Kiran Kumar‌ [ prajwal.kirankumar ] Jyotsna Nayak [ jyotsna.nayak ]

        People

          jyotsna.nayak Jyotsna Nayak
          mahesh.mandhare Mahesh Mandhare (Inactive)
          Votes:
          0 Vote for this issue
          Watchers:
          11 Start watching this issue

          Dates

            Created:
            Updated:

            Gerrit Reviews

              There are no open Gerrit changes

              PagerDuty