Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-32036

[System test]: Eventing rebalance failed because of timeout

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Cannot Reproduce
    • 5.5.3
    • 5.5.3
    • eventing
    • centos2

    Description

      Build: 5.5.3-4029 

      Test: Centos longevity 

      Cycle: 1

      Indexer Rebalance in fails with eventing timeout 

      [user:error,2018-11-14T21:40:10.888-08:00,ns_1@172.23.96.206:<0.11669.0>:ns_orchestrator:do_log_rebalance_completion:1117]Rebalance exited with reason {service_rebalance_failed,eventing,
                                    {rebalance_failed,
                                     {service_error,
                                      <<"eventing rebalance hasn't made progress for past 3600 secs">>}}} 

      Observed undeployment fails as well with ERR_REBALANCE_ONGOING even when undeployment triggered when all previous rebalances are finished

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          wayne Wayne Siu added a comment -

          Vikas Chaudhary Abhishek Singh

          Is this a regression or a new issue?  Thanks.

          wayne Wayne Siu added a comment - Vikas Chaudhary Abhishek Singh Is this a regression or a new issue?  Thanks.

          Wayne Siu Error is very generic, Abhishek Singh will be able to confirm about the regression.

          vikas.chaudhary Vikas Chaudhary added a comment - Wayne Siu Error is very generic, Abhishek Singh will be able to confirm about the regression.

          2018-11-14T20:46:43.685-08:00 [Info] rebalancer::gatherProgress total vbs to shuffle: 328 remaining to shuffle: 328 progress: 0 counter: 131 cmp: true
          2018-11-14T20:46:46.805-08:00 [Info] rebalancer::gatherProgress total vbs to shuffle: 329 remaining to shuffle: 329 progress: 0 counter: 132 cmp: true
          2018-11-14T20:46:49.656-08:00 [Info] rebalancer::gatherProgress total vbs to shuffle: 333 remaining to shuffle: 333 progress: 0 counter: 133 cmp: true
          2018-11-14T20:46:52.890-08:00 [Info] rebalancer::gatherProgress total vbs to shuffle: 353 remaining to shuffle: 353 progress: 0 counter: 134 cmp: true
          2018-11-14T20:46:55.692-08:00 [Info] rebalancer::gatherProgress total vbs to shuffle: 353 remaining to shuffle: 353 progress: 0 counter: 135 cmp: true
          

          This looks to be the case, where Eventing rebalance failed because backlog of events to process is ever growing. As a result, the amount of vbucket streams to be take care as part of rebalance, keeps growing gradually(because KV would disconnect Eventing related connection if NOOP isn't acked in 6mins IIRC). There are 2 options(given we can't back port Rebalance changes from 6.0 to 5.5 because of breadth of changes needed):

          • Either add more Eventing nodes to cluster, increase count from 2 nodes to 4 nodes.
          • Lower the ops on source bucket that Eventing is listening to.

          This behavior has been completely revamped in 6.0 - where Eventing rebalance isn't function of backlog of Events to process.

          asingh Abhishek Singh (Inactive) added a comment - 2018-11-14T20:46:43.685-08:00 [Info] rebalancer::gatherProgress total vbs to shuffle: 328 remaining to shuffle: 328 progress: 0 counter: 131 cmp: true 2018-11-14T20:46:46.805-08:00 [Info] rebalancer::gatherProgress total vbs to shuffle: 329 remaining to shuffle: 329 progress: 0 counter: 132 cmp: true 2018-11-14T20:46:49.656-08:00 [Info] rebalancer::gatherProgress total vbs to shuffle: 333 remaining to shuffle: 333 progress: 0 counter: 133 cmp: true 2018-11-14T20:46:52.890-08:00 [Info] rebalancer::gatherProgress total vbs to shuffle: 353 remaining to shuffle: 353 progress: 0 counter: 134 cmp: true 2018-11-14T20:46:55.692-08:00 [Info] rebalancer::gatherProgress total vbs to shuffle: 353 remaining to shuffle: 353 progress: 0 counter: 135 cmp: true This looks to be the case, where Eventing rebalance failed because backlog of events to process is ever growing. As a result, the amount of vbucket streams to be take care as part of rebalance, keeps growing gradually(because KV would disconnect Eventing related connection if NOOP isn't acked in 6mins IIRC). There are 2 options(given we can't back port Rebalance changes from 6.0 to 5.5 because of breadth of changes needed): Either add more Eventing nodes to cluster, increase count from 2 nodes to 4 nodes. Lower the ops on source bucket that Eventing is listening to. This behavior has been completely revamped in 6.0 - where Eventing rebalance isn't function of backlog of Events to process.

          Any change will change the test itself , hence we will run different test from 5.5.0. We can rerun the test on Centos1 cluster where it ran initially on 5.5.0. Changing test may uncover some other issues.

          CC : Ritam Sharma Mihir Kamdar

           

          vikas.chaudhary Vikas Chaudhary added a comment - Any change will change the test itself , hence we will run different test from 5.5.0. We can rerun the test on Centos1 cluster where it ran initially on 5.5.0. Changing test may uncover some other issues. CC : Ritam Sharma Mihir Kamdar  

          Vikas Chaudhary - Please stick to the same test case and same hardware.

          ritam.sharma Ritam Sharma added a comment - Vikas Chaudhary - Please stick to the same test case and same hardware.
          vikas.chaudhary Vikas Chaudhary added a comment - Restarted test on centos1 http://qa.sc.couchbase.com/job/centos-systest-launcher/1648/console  
          jeelan.poola Jeelan Poola added a comment -

          Eventing rebalance design has completely changed (totally decoupled from backlog size) in 6.0. Hence, this particular issue, can be considered already fixed in 6.0 and backport is non trivial. We have also not seen this issue during 5.5.1/5.5.2.

          A possible work around for cases where a customer may face this issue would be to un-deploy functions before rebalance. So, we would like to defer the MB to 6.0.1 unless it is seen very frequently in 5.5.3.

          jeelan.poola Jeelan Poola added a comment - Eventing rebalance design has completely changed (totally decoupled from backlog size) in 6.0. Hence, this particular issue, can be considered already fixed in 6.0 and backport is non trivial. We have also not seen this issue during 5.5.1/5.5.2. A possible work around for cases where a customer may face this issue would be to un-deploy functions before rebalance. So, we would like to defer the MB to 6.0.1 unless it is seen very frequently in 5.5.3.

          Not seen on latest run on centos1 cluster

          vikas.chaudhary Vikas Chaudhary added a comment - Not seen on latest run on centos1 cluster

          People

            vikas.chaudhary Vikas Chaudhary
            vikas.chaudhary Vikas Chaudhary
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty