Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-42800

Eventing: Rebalance button disabled for failed rebalance

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 6.6.0
    • Fix Version/s: 6.6.1
    • Component/s: eventing
    • Triage:
      Untriaged
    • Story Points:
      1
    • Is this a Regression?:
      Unknown

      Description

      Build: 6.6.0 build 7917

      Test : 

      • Create 4 node cluster 2 KV , 1 eventing , 1 N1QL:index
      • Create 10 buckets 
      • Deploy 2 handler 
      • Failover 1 kv node
      • Add back failover node and full recovery
      • When rebalance for the last bucket is completed around ~90%, deploy 1 more handler
      • Rebalance will fail with 

      Rebalance exited with reason {service_rebalance_failed,eventing,
      {worker_died,
      {'EXIT',<0.30052.79>,
      {{badmatch,
      {error,
      {bad_nodes,eventing,prepare_rebalance,
      [{'ns_1@172.23.106.73',
      {error,
      {unknown_error,
      <<"Some apps are deploying or resuming on some or all Eventing nodes">>}}}]}}},
      [{service_rebalancer,rebalance_worker,1,
      [{file,"src/service_rebalancer.erl"},
      {line,164}]},
      {proc_lib,init_p,3,
      [{file,"proc_lib.erl"},{line,232}]}]}}}}.
      Rebalance Operation Id = 50d80d59791945ddf21089e29de64db0 

      Observed: Rebalance is disabled even though rebalance is failed. Hence customer is not able to retry failed rebalance 

        Attachments

          Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

            Activity

            vikas.chaudhary Vikas Chaudhary created issue -
            jeelan.poola Jeelan Poola made changes -
            Field Original Value New Value
            Assignee Jeelan Poola [ jeelan.poola ] Ankit Prabhu [ ankit.prabhu ]
            Hide
            jeelan.poola Jeelan Poola added a comment -

            PrepareTopologyChange is called at

            2020-11-18T00:21:52.600-08:00 [Info] ServiceMgr::PrepareTopologyChange change: service.TopologyChange{ID:"27c62fab7fb37e7f6d26b5c4fd94f8b0", CurrentTopologyRev:service.Revision(nil), Type:"topology-change-rebalance", KeepNodes:[]struct { NodeInfo service.NodeInfo "json:\"nodeInfo\""; RecoveryType service.RecoveryType "json:\"recoveryType\"" }{struct { NodeInfo service.NodeInfo "json:\"nodeInfo\""; RecoveryType service.RecoveryType "json:\"recoveryType\"" }{NodeInfo:service.NodeInfo{NodeID:"a66e18546ae77e357602fdf52726a39a", Priority:0, Opaque:interface {}(nil)}, RecoveryType:"recovery-full"}}, EjectNodes:[]service.NodeInfo{}}
            

            It found that some functions are deploying

            2020-11-18T00:21:52.838-08:00 [Info] ServiceMgr::checkTopologyChangeReadiness Bootstrap status across all Eventing nodes: true
            2020-11-18T00:21:52.838-08:00 [Warn] ServiceMgr::checkTopologyChangeReadiness Some apps are undergoing bootstrap
            

            And eventing returned failure as it is as per design and expected behaviour (we do not allow rebalance during life cycle ops and vice-e-versa)

            2020-11-18T00:21:52.838-08:00 [Info] ServiceMgr::PrepareTopologyChange failed: Some apps are deploying or resuming on some or all Eventing nodes
            

            Now, we need help from ns_server/UI to know why rebalance button gets disabled on a legit rebalance failure on one of the services.

            Show
            jeelan.poola Jeelan Poola added a comment - PrepareTopologyChange is called at 2020-11-18T00:21:52.600-08:00 [Info] ServiceMgr::PrepareTopologyChange change: service.TopologyChange{ID:"27c62fab7fb37e7f6d26b5c4fd94f8b0", CurrentTopologyRev:service.Revision(nil), Type:"topology-change-rebalance", KeepNodes:[]struct { NodeInfo service.NodeInfo "json:\"nodeInfo\""; RecoveryType service.RecoveryType "json:\"recoveryType\"" }{struct { NodeInfo service.NodeInfo "json:\"nodeInfo\""; RecoveryType service.RecoveryType "json:\"recoveryType\"" }{NodeInfo:service.NodeInfo{NodeID:"a66e18546ae77e357602fdf52726a39a", Priority:0, Opaque:interface {}(nil)}, RecoveryType:"recovery-full"}}, EjectNodes:[]service.NodeInfo{}} It found that some functions are deploying 2020-11-18T00:21:52.838-08:00 [Info] ServiceMgr::checkTopologyChangeReadiness Bootstrap status across all Eventing nodes: true 2020-11-18T00:21:52.838-08:00 [Warn] ServiceMgr::checkTopologyChangeReadiness Some apps are undergoing bootstrap And eventing returned failure as it is as per design and expected behaviour (we do not allow rebalance during life cycle ops and vice-e-versa) 2020-11-18T00:21:52.838-08:00 [Info] ServiceMgr::PrepareTopologyChange failed: Some apps are deploying or resuming on some or all Eventing nodes Now, we need help from ns_server/UI to know why rebalance button gets disabled on a legit rebalance failure on one of the services.
            Hide
            jeelan.poola Jeelan Poola added a comment -

            Dave Finlay Would be great if someone from ns_server can take a look at this and comment on why the rebalance button was disabled in this scenario. If eventing is doing something wrong or not doing something that is expected, we would like to correct it. Thanks a lot in advance!

            FYI, customer's concern is primarily around not being able to re-click the rebalance button. There are work arounds like triggering rebalance through CLI, triggering eventing-internal-rebalance through REST etc. But they are less than desirable for the Customer.

            Show
            jeelan.poola Jeelan Poola added a comment - Dave Finlay Would be great if someone from ns_server can take a look at this and comment on why the rebalance button was disabled in this scenario. If eventing is doing something wrong or not doing something that is expected, we would like to correct it. Thanks a lot in advance! FYI, customer's concern is primarily around not being able to re-click the rebalance button. There are work arounds like triggering rebalance through CLI, triggering eventing-internal-rebalance through REST etc. But they are less than desirable for the Customer.
            Hide
            dfinlay Dave Finlay added a comment - - edited

            Jeelan Poola: In addition to checking that the data is balanced, ns_server asks the services whether they are balanced in terms of enabling the button on the UI. Looks like eventing claims that it is balanced.

            2020-11-18T00:21:52.839-08:00, ns_orchestrator:0:critical:message(ns_1@172.23.106.64) - Rebalance exited with reason {service_rebalance_failed,eventing,
            ...
            [json_rpc:debug,2020-11-18T00:21:58.487-08:00,ns_1@172.23.106.73:json_rpc_connection-eventing-service_api<0.6924.0>:json_rpc_connection:handle_info:94]got response: [{<<"id">>,315},
                           {<<"result">>,
                            {[{<<"rev">>,<<"AAAAAAAAABc=">>},
                              {<<"nodes">>,[<<"a66e18546ae77e357602fdf52726a39a">>]},
                              {<<"isBalanced">>,true}]}},
            ...
            [json_rpc:debug,2020-11-18T01:09:52.942-08:00,ns_1@172.23.106.73:json_rpc_connection-eventing-service_api<0.6924.0>:json_rpc_connection:handle_info:94]got response: [{<<"id">>,511},
                           {<<"result">>,
                            {[{<<"rev">>,<<"AAAAAAAAABc=">>},
                              {<<"nodes">>,[<<"a66e18546ae77e357602fdf52726a39a">>]},
                              {<<"isBalanced">>,true}]}},
                           {<<"error">>,null}]
            

            Show
            dfinlay Dave Finlay added a comment - - edited Jeelan Poola : In addition to checking that the data is balanced, ns_server asks the services whether they are balanced in terms of enabling the button on the UI. Looks like eventing claims that it is balanced. 2020-11-18T00:21:52.839-08:00, ns_orchestrator:0:critical:message(ns_1@172.23.106.64) - Rebalance exited with reason {service_rebalance_failed,eventing, ... [json_rpc:debug,2020-11-18T00:21:58.487-08:00,ns_1@172.23.106.73:json_rpc_connection-eventing-service_api<0.6924.0>:json_rpc_connection:handle_info:94]got response: [{<<"id">>,315}, {<<"result">>, {[{<<"rev">>,<<"AAAAAAAAABc=">>}, {<<"nodes">>,[<<"a66e18546ae77e357602fdf52726a39a">>]}, {<<"isBalanced">>,true}]}}, ... [json_rpc:debug,2020-11-18T01:09:52.942-08:00,ns_1@172.23.106.73:json_rpc_connection-eventing-service_api<0.6924.0>:json_rpc_connection:handle_info:94]got response: [{<<"id">>,511}, {<<"result">>, {[{<<"rev">>,<<"AAAAAAAAABc=">>}, {<<"nodes">>,[<<"a66e18546ae77e357602fdf52726a39a">>]}, {<<"isBalanced">>,true}]}}, {<<"error">>,null}]
            jeelan.poola Jeelan Poola made changes -
            Link This issue blocks CBSE-9243 [ CBSE-9243 ]
            Hide
            jeelan.poola Jeelan Poola added a comment -

            Thank you Dave Finlay! It really helped.
            Ritam SharmaVikas Chaudhary We have a patch with the simple fix (eventing must report isBalanced=false when PrepareTopologyChange() fails). Fix has been verified by Vikas on Toy http://server.jenkins.couchbase.com/view/Toys/job/toy-unix-simple/1758/. Request inclusion in 6.6.1. Thank you!

            Show
            jeelan.poola Jeelan Poola added a comment - Thank you Dave Finlay ! It really helped. Ritam Sharma Vikas Chaudhary We have a patch with the simple fix (eventing must report isBalanced=false when PrepareTopologyChange() fails). Fix has been verified by Vikas on Toy http://server.jenkins.couchbase.com/view/Toys/job/toy-unix-simple/1758/ . Request inclusion in 6.6.1. Thank you!
            Hide
            ritam.sharma Ritam Sharma added a comment -

            Jeelan Poola - Thank you, we should include this for 6.6.1

            Show
            ritam.sharma Ritam Sharma added a comment - Jeelan Poola - Thank you, we should include this for 6.6.1
            jeelan.poola Jeelan Poola made changes -
            Link This issue blocks MB-40528 [ MB-40528 ]
            jeelan.poola Jeelan Poola made changes -
            Link This issue is cloned by MB-42830 [ MB-42830 ]
            jeelan.poola Jeelan Poola made changes -
            Link This issue is a backport of MB-42830 [ MB-42830 ]
            jeelan.poola Jeelan Poola made changes -
            Link This issue is cloned by MB-42830 [ MB-42830 ]
            wayne Wayne Siu made changes -
            Labels approved-for-6.6.1
            jeelan.poola Jeelan Poola made changes -
            Assignee Ankit Prabhu [ ankit.prabhu ] Jeelan Poola [ jeelan.poola ]
            Hide
            build-team Couchbase Build Team added a comment -

            Build couchbase-server-6.6.1-9196 contains eventing commit a61032d with commit message:
            MB-42800: Fix CI script

            Show
            build-team Couchbase Build Team added a comment - Build couchbase-server-6.6.1-9196 contains eventing commit a61032d with commit message: MB-42800 : Fix CI script
            Hide
            build-team Couchbase Build Team added a comment -

            Build couchbase-server-6.6.1-9197 contains eventing commit 27d2b48 with commit message:
            MB-42800 : Report correct isBalanced state on rebalance failure

            Show
            build-team Couchbase Build Team added a comment - Build couchbase-server-6.6.1-9197 contains eventing commit 27d2b48 with commit message: MB-42800 : Report correct isBalanced state on rebalance failure
            jeelan.poola Jeelan Poola made changes -
            Resolution Fixed [ 1 ]
            Status Open [ 1 ] Resolved [ 5 ]
            Hide
            build-team Couchbase Build Team added a comment -

            Build couchbase-server-6.6.1-9198 contains eventing commit f4b5cd7 with commit message:
            MB-42800 : read isBalanced holding the mu lock

            Show
            build-team Couchbase Build Team added a comment - Build couchbase-server-6.6.1-9198 contains eventing commit f4b5cd7 with commit message: MB-42800 : read isBalanced holding the mu lock
            Hide
            vikas.chaudhary Vikas Chaudhary added a comment -

            verified on 6.6.1-9198

            Show
            vikas.chaudhary Vikas Chaudhary added a comment - verified on 6.6.1-9198
            vikas.chaudhary Vikas Chaudhary made changes -
            Status Resolved [ 5 ] Closed [ 6 ]

              People

              Assignee:
              jeelan.poola Jeelan Poola
              Reporter:
              vikas.chaudhary Vikas Chaudhary
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Gerrit Reviews

                  There are no open Gerrit changes

                    PagerDuty