Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-32782

[high-bucket] - rebalance is very slow after failover

    XMLWordPrintable

Details

    • Untriaged
    • Unknown

    Description

      Build 6.5.0-2082

      Observed that in high bucket density test(with 30 buckets), rebalance after hard fail over of kv node is very slow.
      In the test, it is 78% complete on kv nodes after ~14 hours. Please investigate if it is expected.

      Note: Removing kv node(from 4 node to 3 node) from same cluster without failover and rebalance takes ~206 min.

      Test:
      Out of 3 kv nodes, 1 is hard failed over and then rebalance started without adding node back.
      Buckets and docs: 32 buckets ~1M docs of 1KB per bucket.
      Number of replicas: 1
      XDCR: on
      KV ops: ~200 for entire cluster
      Cluster also had index, query, fts, eventing and analytics nodes.

      Logs-
      https://s3.amazonaws.com/bugdb/jira/mh_high_bkt_density_failover/collectinfo-2019-01-23T055648-ns_1%40172.23.97.12.zip
      https://s3.amazonaws.com/bugdb/jira/mh_high_bkt_density_failover/collectinfo-2019-01-23T055648-ns_1%40172.23.97.13.zip

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          mahesh.mandhare Mahesh Mandhare (Inactive) created issue -
          ajit.yagaty Ajit Yagaty [X] (Inactive) made changes -
          Field Original Value New Value
          Assignee Ajit Yagaty [ ajit.yagaty ] Poonam Dhavale [ poonam ]

          Poonam Dhavale - Can you please take a look at this?

          ajit.yagaty Ajit Yagaty [X] (Inactive) added a comment - Poonam Dhavale - Can you please take a look at this?
          raju Raju Suravarjjala made changes -
          Fix Version/s Mad-Hatter [ 15037 ]

           

          As I mentioned in  MB-32642  different types of rebalance can take different amount of time because they may be moving different # of vBuckets.

          In a 4 -> 3 rebalance out, 512 vBuckets will be moved in each bucket,

          In a 3 ->2  rebalance out (which is the case here), 680 vBuckets will be moved in each bucket.

          So, it is expected that the 3 ->2 will take longer.

          I checked the logs. There are total 32 buckets (30 + 2 eventing buckets). Each bucket rebalance is taking ~30-40 minutes. 

          Rebalance was running for around 12 1/2 hours when cbcollect was issued. And, had rebalanced 22 buckets.

          • We need logs from the orchestrator node (172.23.96.20) for the 3 ->2 rebalance.
          • Also, need logs from 4 ->3  rebalance.
          • Then we can compare the two to see where the time is spent for each, how long vBucket moves are taking and so one. This will help then determine whether 3 ->2 rebalance is taking longer primarily because it is moving 168 (=680 -512) more vBuckets or something else.

           

           

           

           

          poonam Poonam Dhavale added a comment -   As I mentioned in   MB-32642   different types of rebalance can take different amount of time because they may be moving different # of vBuckets. In a 4 -> 3 rebalance out, 512 vBuckets will be moved in each bucket, In a 3 ->2  rebalance out (which is the case here), 680 vBuckets will be moved in each bucket. So, it is expected that the 3 ->2 will take longer. I checked the logs. There are total 32 buckets (30 + 2 eventing buckets). Each bucket rebalance is taking ~30-40 minutes.  Rebalance was running for around 12 1/2 hours when cbcollect was issued. And, had rebalanced 22 buckets. We need logs from the orchestrator node (172.23.96.20) for the 3 ->2 rebalance. Also, need logs from 4 ->3  rebalance. Then we can compare the two to see where the time is spent for each, how long vBucket moves are taking and so one. This will help then determine whether 3 ->2 rebalance is taking longer primarily because it is moving 168 (=680 -512) more vBuckets or something else.        
          poonam Poonam Dhavale made changes -
          Assignee Poonam Dhavale [ poonam ] Mahesh Mandhare [ mahesh.mandhare ]

          Poonam Dhavale,

          Will collect required logs next time we run this case.

          mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - Poonam Dhavale , Will collect required logs next time we run this case.

          Mahesh Mandhare was this faster when you did the same test on 5.5.2?

          shivani.gupta Shivani Gupta added a comment - Mahesh Mandhare was this faster when you did the same test on 5.5.2?

          Shivani Gupta , we didn't run hard failover on 5.5.2 build.

          mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - Shivani Gupta , we didn't run hard failover on 5.5.2 build.
          mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - Build 6.5.0-2640 In a recent run, rebalance after failover took ~380 min to complete. Logs- https://s3.amazonaws.com/bugdb/jira/hbd-logs-2/collectinfo-2019-03-27T034329-ns_1@172.23.97.12.zip https://s3.amazonaws.com/bugdb/jira/hbd-logs-2/collectinfo-2019-03-27T034329-ns_1@172.23.97.13.zip https://s3.amazonaws.com/bugdb/jira/hbd-logs-2/collectinfo-2019-03-27T034329-ns_1@172.23.96.20.zip https://s3.amazonaws.com/bugdb/jira/hbd-logs-2/collectinfo-2019-03-27T034329-ns_1@172.23.96.23.zip https://s3.amazonaws.com/bugdb/jira/hbd-logs-2/collectinfo-2019-03-27T034329-ns_1@172.23.97.177.zip https://s3.amazonaws.com/bugdb/jira/hbd-logs-2/collectinfo-2019-03-27T034329-ns_1@172.23.97.19.zip https://s3.amazonaws.com/bugdb/jira/hbd-logs-2/collectinfo-2019-03-27T034329-ns_1@172.23.97.20.zip
          mahesh.mandhare Mahesh Mandhare (Inactive) made changes -
          Assignee Mahesh Mandhare [ mahesh.mandhare ] Poonam Dhavale [ poonam ]
          poonam Poonam Dhavale made changes -
          Assignee Poonam Dhavale [ poonam ] Mahesh Mandhare [ mahesh.mandhare ]

           

          Hi Mahesh,

           

          Links to the logs are not working.

          poonam Poonam Dhavale added a comment -   Hi Mahesh,   Links to the logs are not working.
          mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - Build 6.5.0-2802 In a recent run, rebalance after failover took ~425 min to complete. Logs- https://s3.amazonaws.com/bugdb/jira/hbd_3/collectinfo-2019-04-11T041252-ns_1%40172.23.96.20.zip https://s3.amazonaws.com/bugdb/jira/hbd_3/collectinfo-2019-04-11T041252-ns_1%40172.23.96.23.zip https://s3.amazonaws.com/bugdb/jira/hbd_3/collectinfo-2019-04-11T041252-ns_1%40172.23.97.12.zip https://s3.amazonaws.com/bugdb/jira/hbd_3/collectinfo-2019-04-11T041252-ns_1%40172.23.97.14.zip https://s3.amazonaws.com/bugdb/jira/hbd_3/collectinfo-2019-04-11T041252-ns_1%40172.23.97.15.zip https://s3.amazonaws.com/bugdb/jira/hbd_3/collectinfo-2019-04-11T041252-ns_1%40172.23.97.177.zip https://s3.amazonaws.com/bugdb/jira/hbd_3/collectinfo-2019-04-11T041252-ns_1%40172.23.97.19.zip https://s3.amazonaws.com/bugdb/jira/hbd_3/collectinfo-2019-04-11T041252-ns_1%40172.23.97.20.zip
          mahesh.mandhare Mahesh Mandhare (Inactive) made changes -
          Assignee Mahesh Mandhare [ mahesh.mandhare ] Poonam Dhavale [ poonam ]

           

          Hi Mahesh Mandhare, the logs are not accessible. Please keep them around for longer.

          poonam Poonam Dhavale added a comment -   Hi Mahesh Mandhare , the logs are not accessible. Please keep them around for longer.
          poonam Poonam Dhavale made changes -
          Assignee Poonam Dhavale [ poonam ] Mahesh Mandhare [ mahesh.mandhare ]
          mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - I had local copy of logs, uploaded again at- https://s3.amazonaws.com/bugdb/jira/hbd_4/collectinfo-2019-04-11T041252-ns_1@172.23.96.20.zip   https://s3.amazonaws.com/bugdb/jira/hbd_4/collectinfo-2019-04-11T041252-ns_1@172.23.97.14.zip   https://s3.amazonaws.com/bugdb/jira/hbd_4/collectinfo-2019-04-11T041252-ns_1@172.23.97.19.zip   https://s3.amazonaws.com/bugdb/jira/hbd_4/collectinfo-2019-04-11T041252-ns_1@172.23.96.23.zip   https://s3.amazonaws.com/bugdb/jira/hbd_4/collectinfo-2019-04-11T041252-ns_1@172.23.97.15.zip   https://s3.amazonaws.com/bugdb/jira/hbd_4/collectinfo-2019-04-11T041252-ns_1@172.23.97.20.zip   https://s3.amazonaws.com/bugdb/jira/hbd_4/collectinfo-2019-04-11T041252-ns_1@172.23.97.12.zip   https://s3.amazonaws.com/bugdb/jira/hbd_4/collectinfo-2019-04-11T041252-ns_1@172.23.97.177.zip  
          mahesh.mandhare Mahesh Mandhare (Inactive) made changes -
          Assignee Mahesh Mandhare [ mahesh.mandhare ] Poonam Dhavale [ poonam ]

           
          The rebalance logs have rotated out. Only messages from the last rebalance are available. 
           
          As I mentioned earlier, rebalance-out after failover (with 3 KV nodes) will take longer than 4 -> 3 rebalance because more # of vBuckets are being moved  during the former.  Also, the vBucket scheduling logic plays a role as explained in MB-32642. These are most likely the root cause but I was hoping to compare avg. time taken to move a vBucket during the two types of rebalance.
           
          Mahesh, here are couple of options:
           
          Option #1: Rerun the 4 -> 3 rebalance and rebalance-out after failover tests with 30 buckets. Collect logs after each type of rebalance.  
           
          Option #2: Run the 4 -> 3 rebalance and rebalance-out after failover tests with fewer buckets (<= 10). Compare the rebalance time increase  with the one seen during the 30 bucket tests. E.g. Say rebalance-out after failover takes X% longer than 4 ->3 rebalance with 10 buckets. Is the time increase around the same (~X%) with the 30 bucket test? If yes, then we can close this ticket.

          poonam Poonam Dhavale added a comment -   The rebalance logs have rotated out. Only messages from the last rebalance are available.    As I mentioned earlier, rebalance-out after failover (with 3 KV nodes) will take longer than 4 -> 3 rebalance because more # of vBuckets are being moved  during the former.  Also, the vBucket scheduling logic plays a role as explained in  MB-32642 . These are most likely the root cause but I was hoping to compare avg. time taken to move a vBucket during the two types of rebalance.   Mahesh, here are couple of options:   Option #1: Rerun the 4 -> 3 rebalance and rebalance-out after failover tests with 30 buckets. Collect logs after each type of rebalance.     Option #2: Run the 4 -> 3 rebalance and rebalance-out after failover tests with fewer buckets (<= 10). Compare the rebalance time increase  with the one seen during the 30 bucket tests. E.g. Say rebalance-out after failover takes X% longer than 4 ->3 rebalance with 10 buckets. Is the time increase around the same (~X%) with the 30 bucket test? If yes, then we can close this ticket.
          poonam Poonam Dhavale made changes -
          Assignee Poonam Dhavale [ poonam ] Mahesh Mandhare [ mahesh.mandhare ]
          wayne Wayne Siu made changes -
          Summary rebalance is very slow after failover [high-bucket] - rebalance is very slow after failover
          Aliaksey Artamonau Aliaksey Artamonau (Inactive) made changes -
          Resolution Incomplete [ 4 ]
          Status Open [ 1 ] Resolved [ 5 ]

          Bulk closing all invalid, duplicate and won't fix bugs. Please feel free to reopen them

          raju Raju Suravarjjala added a comment - Bulk closing all invalid, duplicate and won't fix bugs. Please feel free to reopen them
          raju Raju Suravarjjala made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

          People

            mahesh.mandhare Mahesh Mandhare (Inactive)
            mahesh.mandhare Mahesh Mandhare (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty