Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-32782

[high-bucket] - rebalance is very slow after failover

    XMLWordPrintable

Details

    • Untriaged
    • Unknown

    Description

      Build 6.5.0-2082

      Observed that in high bucket density test(with 30 buckets), rebalance after hard fail over of kv node is very slow.
      In the test, it is 78% complete on kv nodes after ~14 hours. Please investigate if it is expected.

      Note: Removing kv node(from 4 node to 3 node) from same cluster without failover and rebalance takes ~206 min.

      Test:
      Out of 3 kv nodes, 1 is hard failed over and then rebalance started without adding node back.
      Buckets and docs: 32 buckets ~1M docs of 1KB per bucket.
      Number of replicas: 1
      XDCR: on
      KV ops: ~200 for entire cluster
      Cluster also had index, query, fts, eventing and analytics nodes.

      Logs-
      https://s3.amazonaws.com/bugdb/jira/mh_high_bkt_density_failover/collectinfo-2019-01-23T055648-ns_1%40172.23.97.12.zip
      https://s3.amazonaws.com/bugdb/jira/mh_high_bkt_density_failover/collectinfo-2019-01-23T055648-ns_1%40172.23.97.13.zip

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          Poonam Dhavale - Can you please take a look at this?

          ajit.yagaty Ajit Yagaty [X] (Inactive) added a comment - Poonam Dhavale - Can you please take a look at this?

           

          As I mentioned in  MB-32642  different types of rebalance can take different amount of time because they may be moving different # of vBuckets.

          In a 4 -> 3 rebalance out, 512 vBuckets will be moved in each bucket,

          In a 3 ->2  rebalance out (which is the case here), 680 vBuckets will be moved in each bucket.

          So, it is expected that the 3 ->2 will take longer.

          I checked the logs. There are total 32 buckets (30 + 2 eventing buckets). Each bucket rebalance is taking ~30-40 minutes. 

          Rebalance was running for around 12 1/2 hours when cbcollect was issued. And, had rebalanced 22 buckets.

          • We need logs from the orchestrator node (172.23.96.20) for the 3 ->2 rebalance.
          • Also, need logs from 4 ->3  rebalance.
          • Then we can compare the two to see where the time is spent for each, how long vBucket moves are taking and so one. This will help then determine whether 3 ->2 rebalance is taking longer primarily because it is moving 168 (=680 -512) more vBuckets or something else.

           

           

           

           

          poonam Poonam Dhavale added a comment -   As I mentioned in   MB-32642   different types of rebalance can take different amount of time because they may be moving different # of vBuckets. In a 4 -> 3 rebalance out, 512 vBuckets will be moved in each bucket, In a 3 ->2  rebalance out (which is the case here), 680 vBuckets will be moved in each bucket. So, it is expected that the 3 ->2 will take longer. I checked the logs. There are total 32 buckets (30 + 2 eventing buckets). Each bucket rebalance is taking ~30-40 minutes.  Rebalance was running for around 12 1/2 hours when cbcollect was issued. And, had rebalanced 22 buckets. We need logs from the orchestrator node (172.23.96.20) for the 3 ->2 rebalance. Also, need logs from 4 ->3  rebalance. Then we can compare the two to see where the time is spent for each, how long vBucket moves are taking and so one. This will help then determine whether 3 ->2 rebalance is taking longer primarily because it is moving 168 (=680 -512) more vBuckets or something else.        

          Poonam Dhavale,

          Will collect required logs next time we run this case.

          mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - Poonam Dhavale , Will collect required logs next time we run this case.

          Mahesh Mandhare was this faster when you did the same test on 5.5.2?

          shivani.gupta Shivani Gupta added a comment - Mahesh Mandhare was this faster when you did the same test on 5.5.2?

          Shivani Gupta , we didn't run hard failover on 5.5.2 build.

          mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - Shivani Gupta , we didn't run hard failover on 5.5.2 build.
          mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - Build 6.5.0-2640 In a recent run, rebalance after failover took ~380 min to complete. Logs- https://s3.amazonaws.com/bugdb/jira/hbd-logs-2/collectinfo-2019-03-27T034329-ns_1@172.23.97.12.zip https://s3.amazonaws.com/bugdb/jira/hbd-logs-2/collectinfo-2019-03-27T034329-ns_1@172.23.97.13.zip https://s3.amazonaws.com/bugdb/jira/hbd-logs-2/collectinfo-2019-03-27T034329-ns_1@172.23.96.20.zip https://s3.amazonaws.com/bugdb/jira/hbd-logs-2/collectinfo-2019-03-27T034329-ns_1@172.23.96.23.zip https://s3.amazonaws.com/bugdb/jira/hbd-logs-2/collectinfo-2019-03-27T034329-ns_1@172.23.97.177.zip https://s3.amazonaws.com/bugdb/jira/hbd-logs-2/collectinfo-2019-03-27T034329-ns_1@172.23.97.19.zip https://s3.amazonaws.com/bugdb/jira/hbd-logs-2/collectinfo-2019-03-27T034329-ns_1@172.23.97.20.zip

           

          Hi Mahesh,

           

          Links to the logs are not working.

          poonam Poonam Dhavale added a comment -   Hi Mahesh,   Links to the logs are not working.
          mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - Build 6.5.0-2802 In a recent run, rebalance after failover took ~425 min to complete. Logs- https://s3.amazonaws.com/bugdb/jira/hbd_3/collectinfo-2019-04-11T041252-ns_1%40172.23.96.20.zip https://s3.amazonaws.com/bugdb/jira/hbd_3/collectinfo-2019-04-11T041252-ns_1%40172.23.96.23.zip https://s3.amazonaws.com/bugdb/jira/hbd_3/collectinfo-2019-04-11T041252-ns_1%40172.23.97.12.zip https://s3.amazonaws.com/bugdb/jira/hbd_3/collectinfo-2019-04-11T041252-ns_1%40172.23.97.14.zip https://s3.amazonaws.com/bugdb/jira/hbd_3/collectinfo-2019-04-11T041252-ns_1%40172.23.97.15.zip https://s3.amazonaws.com/bugdb/jira/hbd_3/collectinfo-2019-04-11T041252-ns_1%40172.23.97.177.zip https://s3.amazonaws.com/bugdb/jira/hbd_3/collectinfo-2019-04-11T041252-ns_1%40172.23.97.19.zip https://s3.amazonaws.com/bugdb/jira/hbd_3/collectinfo-2019-04-11T041252-ns_1%40172.23.97.20.zip

           

          Hi Mahesh Mandhare, the logs are not accessible. Please keep them around for longer.

          poonam Poonam Dhavale added a comment -   Hi Mahesh Mandhare , the logs are not accessible. Please keep them around for longer.
          mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - I had local copy of logs, uploaded again at- https://s3.amazonaws.com/bugdb/jira/hbd_4/collectinfo-2019-04-11T041252-ns_1@172.23.96.20.zip   https://s3.amazonaws.com/bugdb/jira/hbd_4/collectinfo-2019-04-11T041252-ns_1@172.23.97.14.zip   https://s3.amazonaws.com/bugdb/jira/hbd_4/collectinfo-2019-04-11T041252-ns_1@172.23.97.19.zip   https://s3.amazonaws.com/bugdb/jira/hbd_4/collectinfo-2019-04-11T041252-ns_1@172.23.96.23.zip   https://s3.amazonaws.com/bugdb/jira/hbd_4/collectinfo-2019-04-11T041252-ns_1@172.23.97.15.zip   https://s3.amazonaws.com/bugdb/jira/hbd_4/collectinfo-2019-04-11T041252-ns_1@172.23.97.20.zip   https://s3.amazonaws.com/bugdb/jira/hbd_4/collectinfo-2019-04-11T041252-ns_1@172.23.97.12.zip   https://s3.amazonaws.com/bugdb/jira/hbd_4/collectinfo-2019-04-11T041252-ns_1@172.23.97.177.zip  

           
          The rebalance logs have rotated out. Only messages from the last rebalance are available. 
           
          As I mentioned earlier, rebalance-out after failover (with 3 KV nodes) will take longer than 4 -> 3 rebalance because more # of vBuckets are being moved  during the former.  Also, the vBucket scheduling logic plays a role as explained in MB-32642. These are most likely the root cause but I was hoping to compare avg. time taken to move a vBucket during the two types of rebalance.
           
          Mahesh, here are couple of options:
           
          Option #1: Rerun the 4 -> 3 rebalance and rebalance-out after failover tests with 30 buckets. Collect logs after each type of rebalance.  
           
          Option #2: Run the 4 -> 3 rebalance and rebalance-out after failover tests with fewer buckets (<= 10). Compare the rebalance time increase  with the one seen during the 30 bucket tests. E.g. Say rebalance-out after failover takes X% longer than 4 ->3 rebalance with 10 buckets. Is the time increase around the same (~X%) with the 30 bucket test? If yes, then we can close this ticket.

          poonam Poonam Dhavale added a comment -   The rebalance logs have rotated out. Only messages from the last rebalance are available.    As I mentioned earlier, rebalance-out after failover (with 3 KV nodes) will take longer than 4 -> 3 rebalance because more # of vBuckets are being moved  during the former.  Also, the vBucket scheduling logic plays a role as explained in  MB-32642 . These are most likely the root cause but I was hoping to compare avg. time taken to move a vBucket during the two types of rebalance.   Mahesh, here are couple of options:   Option #1: Rerun the 4 -> 3 rebalance and rebalance-out after failover tests with 30 buckets. Collect logs after each type of rebalance.     Option #2: Run the 4 -> 3 rebalance and rebalance-out after failover tests with fewer buckets (<= 10). Compare the rebalance time increase  with the one seen during the 30 bucket tests. E.g. Say rebalance-out after failover takes X% longer than 4 ->3 rebalance with 10 buckets. Is the time increase around the same (~X%) with the 30 bucket test? If yes, then we can close this ticket.

          Bulk closing all invalid, duplicate and won't fix bugs. Please feel free to reopen them

          raju Raju Suravarjjala added a comment - Bulk closing all invalid, duplicate and won't fix bugs. Please feel free to reopen them

          People

            mahesh.mandhare Mahesh Mandhare (Inactive)
            mahesh.mandhare Mahesh Mandhare (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty