Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-32782

[high-bucket] - rebalance is very slow after failover

    XMLWordPrintable

Details

    • Untriaged
    • Unknown

    Description

      Build 6.5.0-2082

      Observed that in high bucket density test(with 30 buckets), rebalance after hard fail over of kv node is very slow.
      In the test, it is 78% complete on kv nodes after ~14 hours. Please investigate if it is expected.

      Note: Removing kv node(from 4 node to 3 node) from same cluster without failover and rebalance takes ~206 min.

      Test:
      Out of 3 kv nodes, 1 is hard failed over and then rebalance started without adding node back.
      Buckets and docs: 32 buckets ~1M docs of 1KB per bucket.
      Number of replicas: 1
      XDCR: on
      KV ops: ~200 for entire cluster
      Cluster also had index, query, fts, eventing and analytics nodes.

      Logs-
      https://s3.amazonaws.com/bugdb/jira/mh_high_bkt_density_failover/collectinfo-2019-01-23T055648-ns_1%40172.23.97.12.zip
      https://s3.amazonaws.com/bugdb/jira/mh_high_bkt_density_failover/collectinfo-2019-01-23T055648-ns_1%40172.23.97.13.zip

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - Build 6.5.0-2802 In a recent run, rebalance after failover took ~425 min to complete. Logs- https://s3.amazonaws.com/bugdb/jira/hbd_3/collectinfo-2019-04-11T041252-ns_1%40172.23.96.20.zip https://s3.amazonaws.com/bugdb/jira/hbd_3/collectinfo-2019-04-11T041252-ns_1%40172.23.96.23.zip https://s3.amazonaws.com/bugdb/jira/hbd_3/collectinfo-2019-04-11T041252-ns_1%40172.23.97.12.zip https://s3.amazonaws.com/bugdb/jira/hbd_3/collectinfo-2019-04-11T041252-ns_1%40172.23.97.14.zip https://s3.amazonaws.com/bugdb/jira/hbd_3/collectinfo-2019-04-11T041252-ns_1%40172.23.97.15.zip https://s3.amazonaws.com/bugdb/jira/hbd_3/collectinfo-2019-04-11T041252-ns_1%40172.23.97.177.zip https://s3.amazonaws.com/bugdb/jira/hbd_3/collectinfo-2019-04-11T041252-ns_1%40172.23.97.19.zip https://s3.amazonaws.com/bugdb/jira/hbd_3/collectinfo-2019-04-11T041252-ns_1%40172.23.97.20.zip

           

          Hi Mahesh Mandhare, the logs are not accessible. Please keep them around for longer.

          poonam Poonam Dhavale added a comment -   Hi Mahesh Mandhare , the logs are not accessible. Please keep them around for longer.
          mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - I had local copy of logs, uploaded again at- https://s3.amazonaws.com/bugdb/jira/hbd_4/collectinfo-2019-04-11T041252-ns_1@172.23.96.20.zip   https://s3.amazonaws.com/bugdb/jira/hbd_4/collectinfo-2019-04-11T041252-ns_1@172.23.97.14.zip   https://s3.amazonaws.com/bugdb/jira/hbd_4/collectinfo-2019-04-11T041252-ns_1@172.23.97.19.zip   https://s3.amazonaws.com/bugdb/jira/hbd_4/collectinfo-2019-04-11T041252-ns_1@172.23.96.23.zip   https://s3.amazonaws.com/bugdb/jira/hbd_4/collectinfo-2019-04-11T041252-ns_1@172.23.97.15.zip   https://s3.amazonaws.com/bugdb/jira/hbd_4/collectinfo-2019-04-11T041252-ns_1@172.23.97.20.zip   https://s3.amazonaws.com/bugdb/jira/hbd_4/collectinfo-2019-04-11T041252-ns_1@172.23.97.12.zip   https://s3.amazonaws.com/bugdb/jira/hbd_4/collectinfo-2019-04-11T041252-ns_1@172.23.97.177.zip  

           
          The rebalance logs have rotated out. Only messages from the last rebalance are available. 
           
          As I mentioned earlier, rebalance-out after failover (with 3 KV nodes) will take longer than 4 -> 3 rebalance because more # of vBuckets are being moved  during the former.  Also, the vBucket scheduling logic plays a role as explained in MB-32642. These are most likely the root cause but I was hoping to compare avg. time taken to move a vBucket during the two types of rebalance.
           
          Mahesh, here are couple of options:
           
          Option #1: Rerun the 4 -> 3 rebalance and rebalance-out after failover tests with 30 buckets. Collect logs after each type of rebalance.  
           
          Option #2: Run the 4 -> 3 rebalance and rebalance-out after failover tests with fewer buckets (<= 10). Compare the rebalance time increase  with the one seen during the 30 bucket tests. E.g. Say rebalance-out after failover takes X% longer than 4 ->3 rebalance with 10 buckets. Is the time increase around the same (~X%) with the 30 bucket test? If yes, then we can close this ticket.

          poonam Poonam Dhavale added a comment -   The rebalance logs have rotated out. Only messages from the last rebalance are available.    As I mentioned earlier, rebalance-out after failover (with 3 KV nodes) will take longer than 4 -> 3 rebalance because more # of vBuckets are being moved  during the former.  Also, the vBucket scheduling logic plays a role as explained in  MB-32642 . These are most likely the root cause but I was hoping to compare avg. time taken to move a vBucket during the two types of rebalance.   Mahesh, here are couple of options:   Option #1: Rerun the 4 -> 3 rebalance and rebalance-out after failover tests with 30 buckets. Collect logs after each type of rebalance.     Option #2: Run the 4 -> 3 rebalance and rebalance-out after failover tests with fewer buckets (<= 10). Compare the rebalance time increase  with the one seen during the 30 bucket tests. E.g. Say rebalance-out after failover takes X% longer than 4 ->3 rebalance with 10 buckets. Is the time increase around the same (~X%) with the 30 bucket test? If yes, then we can close this ticket.

          Bulk closing all invalid, duplicate and won't fix bugs. Please feel free to reopen them

          raju Raju Suravarjjala added a comment - Bulk closing all invalid, duplicate and won't fix bugs. Please feel free to reopen them

          People

            mahesh.mandhare Mahesh Mandhare (Inactive)
            mahesh.mandhare Mahesh Mandhare (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty