Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-32782

[high-bucket] - rebalance is very slow after failover

    XMLWordPrintable

Details

    • Untriaged
    • Unknown

    Description

      Build 6.5.0-2082

      Observed that in high bucket density test(with 30 buckets), rebalance after hard fail over of kv node is very slow.
      In the test, it is 78% complete on kv nodes after ~14 hours. Please investigate if it is expected.

      Note: Removing kv node(from 4 node to 3 node) from same cluster without failover and rebalance takes ~206 min.

      Test:
      Out of 3 kv nodes, 1 is hard failed over and then rebalance started without adding node back.
      Buckets and docs: 32 buckets ~1M docs of 1KB per bucket.
      Number of replicas: 1
      XDCR: on
      KV ops: ~200 for entire cluster
      Cluster also had index, query, fts, eventing and analytics nodes.

      Logs-
      https://s3.amazonaws.com/bugdb/jira/mh_high_bkt_density_failover/collectinfo-2019-01-23T055648-ns_1%40172.23.97.12.zip
      https://s3.amazonaws.com/bugdb/jira/mh_high_bkt_density_failover/collectinfo-2019-01-23T055648-ns_1%40172.23.97.13.zip

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          Bulk closing all invalid, duplicate and won't fix bugs. Please feel free to reopen them

          raju Raju Suravarjjala added a comment - Bulk closing all invalid, duplicate and won't fix bugs. Please feel free to reopen them

           
          The rebalance logs have rotated out. Only messages from the last rebalance are available. 
           
          As I mentioned earlier, rebalance-out after failover (with 3 KV nodes) will take longer than 4 -> 3 rebalance because more # of vBuckets are being moved  during the former.  Also, the vBucket scheduling logic plays a role as explained in MB-32642. These are most likely the root cause but I was hoping to compare avg. time taken to move a vBucket during the two types of rebalance.
           
          Mahesh, here are couple of options:
           
          Option #1: Rerun the 4 -> 3 rebalance and rebalance-out after failover tests with 30 buckets. Collect logs after each type of rebalance.  
           
          Option #2: Run the 4 -> 3 rebalance and rebalance-out after failover tests with fewer buckets (<= 10). Compare the rebalance time increase  with the one seen during the 30 bucket tests. E.g. Say rebalance-out after failover takes X% longer than 4 ->3 rebalance with 10 buckets. Is the time increase around the same (~X%) with the 30 bucket test? If yes, then we can close this ticket.

          poonam Poonam Dhavale added a comment -   The rebalance logs have rotated out. Only messages from the last rebalance are available.    As I mentioned earlier, rebalance-out after failover (with 3 KV nodes) will take longer than 4 -> 3 rebalance because more # of vBuckets are being moved  during the former.  Also, the vBucket scheduling logic plays a role as explained in  MB-32642 . These are most likely the root cause but I was hoping to compare avg. time taken to move a vBucket during the two types of rebalance.   Mahesh, here are couple of options:   Option #1: Rerun the 4 -> 3 rebalance and rebalance-out after failover tests with 30 buckets. Collect logs after each type of rebalance.     Option #2: Run the 4 -> 3 rebalance and rebalance-out after failover tests with fewer buckets (<= 10). Compare the rebalance time increase  with the one seen during the 30 bucket tests. E.g. Say rebalance-out after failover takes X% longer than 4 ->3 rebalance with 10 buckets. Is the time increase around the same (~X%) with the 30 bucket test? If yes, then we can close this ticket.
          mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - I had local copy of logs, uploaded again at- https://s3.amazonaws.com/bugdb/jira/hbd_4/collectinfo-2019-04-11T041252-ns_1@172.23.96.20.zip   https://s3.amazonaws.com/bugdb/jira/hbd_4/collectinfo-2019-04-11T041252-ns_1@172.23.97.14.zip   https://s3.amazonaws.com/bugdb/jira/hbd_4/collectinfo-2019-04-11T041252-ns_1@172.23.97.19.zip   https://s3.amazonaws.com/bugdb/jira/hbd_4/collectinfo-2019-04-11T041252-ns_1@172.23.96.23.zip   https://s3.amazonaws.com/bugdb/jira/hbd_4/collectinfo-2019-04-11T041252-ns_1@172.23.97.15.zip   https://s3.amazonaws.com/bugdb/jira/hbd_4/collectinfo-2019-04-11T041252-ns_1@172.23.97.20.zip   https://s3.amazonaws.com/bugdb/jira/hbd_4/collectinfo-2019-04-11T041252-ns_1@172.23.97.12.zip   https://s3.amazonaws.com/bugdb/jira/hbd_4/collectinfo-2019-04-11T041252-ns_1@172.23.97.177.zip  

           

          Hi Mahesh Mandhare, the logs are not accessible. Please keep them around for longer.

          poonam Poonam Dhavale added a comment -   Hi Mahesh Mandhare , the logs are not accessible. Please keep them around for longer.
          mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - Build 6.5.0-2802 In a recent run, rebalance after failover took ~425 min to complete. Logs- https://s3.amazonaws.com/bugdb/jira/hbd_3/collectinfo-2019-04-11T041252-ns_1%40172.23.96.20.zip https://s3.amazonaws.com/bugdb/jira/hbd_3/collectinfo-2019-04-11T041252-ns_1%40172.23.96.23.zip https://s3.amazonaws.com/bugdb/jira/hbd_3/collectinfo-2019-04-11T041252-ns_1%40172.23.97.12.zip https://s3.amazonaws.com/bugdb/jira/hbd_3/collectinfo-2019-04-11T041252-ns_1%40172.23.97.14.zip https://s3.amazonaws.com/bugdb/jira/hbd_3/collectinfo-2019-04-11T041252-ns_1%40172.23.97.15.zip https://s3.amazonaws.com/bugdb/jira/hbd_3/collectinfo-2019-04-11T041252-ns_1%40172.23.97.177.zip https://s3.amazonaws.com/bugdb/jira/hbd_3/collectinfo-2019-04-11T041252-ns_1%40172.23.97.19.zip https://s3.amazonaws.com/bugdb/jira/hbd_3/collectinfo-2019-04-11T041252-ns_1%40172.23.97.20.zip

          People

            mahesh.mandhare Mahesh Mandhare (Inactive)
            mahesh.mandhare Mahesh Mandhare (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty