Uploaded image for project: 'Couchbase Kubernetes'
  1. Couchbase Kubernetes
  2. K8S-479

Resize nodes: Rebalance failed as per couchbase-server but operator event says rebalance is success

    XMLWordPrintable

Details

    Description

      Scenario:

      1. Deploying cluster with 2 nodes with data, index, query and 3 nodes with eventing service
      2. Enabled eventing function with 3 buckets in cluster
      3. Running goroutine to insert data to src bucket in backgroud
      4. Initiate resize of analytics node to 4 nodes

      After adding an extra eventing node, rebalance is started and failed with the reason,

      Rebalance exited with reason {service_rebalance_failed,eventing, {badmatch, {error, {unknown_error, <<"Some apps are undergoing bootstrap">>}}}}

      But operator generated the event as "rebalance completed"

       

      Testcase name: TestEventingResizeCluster

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            simon.murray Simon Murray added a comment - - edited

            As far as the operator is concerned it's a success if all the nodes are added or ejected as expected.  The /pools/default/tasks API (which we use for progress) doesn't return any status.  /diag/masterEvents should allow us to get the actual state out, but I hazard a guess this could get racy.

            Logically the cluster will require a rebalance next time around and it will sort itself out, so working as designed.

            I guess we could poll the eventing service and see what its status is before allowing reconcile.  Problem is if eventing is broken in some way we'll never reconcile...

            Relevant issues:

            https://issues.couchbase.com/browse/MB-26859

            simon.murray Simon Murray added a comment - - edited As far as the operator is concerned it's a success if all the nodes are added or ejected as expected.  The /pools/default/tasks API (which we use for progress) doesn't return any status.  /diag/masterEvents should allow us to get the actual state out, but I hazard a guess this could get racy. Logically the cluster will require a rebalance next time around and it will sort itself out, so working as designed. I guess we could poll the eventing service and see what its status is before allowing reconcile.  Problem is if eventing is broken in some way we'll never reconcile... Relevant issues: https://issues.couchbase.com/browse/MB-26859
            simon.murray Simon Murray added a comment -

            Rebalance fails while a function is being deployed.  Deployment begins with an entry being added to /api/v1/stats, only once deployed do fields such as $[].execution_stats get populated.

            Plan of attack: before reconcile collect at all live nodes who have eventing enabled.  Poll the API for deployed functions who are bootstrapping, abort if any are found

            simon.murray Simon Murray added a comment - Rebalance fails while a function is being deployed.  Deployment begins with an entry being added to /api/v1/stats, only once deployed do fields such as $[].execution_stats get populated. Plan of attack: before reconcile collect at all live nodes who have eventing enabled.  Poll the API for deployed functions who are bootstrapping, abort if any are found
            lynn.straus Lynn Straus added a comment - - edited

            Per review in July 24 K8s meeting, confirmed this is not "must-fix" for 1.0.  Simon Murray to defer to appropriate future version.  Thanks!

             

            Per July 25 bug review, after further review, there are some test related issues being addressed so will remain in 1.0.

            lynn.straus Lynn Straus added a comment - - edited Per review in July 24 K8s meeting, confirmed this is not "must-fix" for 1.0.   Simon Murray to defer to appropriate future version.  Thanks!   Per July 25 bug review, after further review, there are some test related issues being addressed so will remain in 1.0.
            simon.murray Simon Murray added a comment -

            I've added a catch all in the rebalance code, from what I can see each time the issue reared its head it was always reported as failed due to nodes not being correctly balanced in shrug.  Keeping this issue open for tracking the test fixes.

            simon.murray Simon Murray added a comment - I've added a catch all in the rebalance code, from what I can see each time the issue reared its head it was always reported as failed due to nodes not being correctly balanced in shrug.   Keeping this issue open for tracking the test fixes.

            Operator is now catching the rebalance failure from couchbase-cluster.

            So closing the ticket.

            ashwin.govindarajulu Ashwin Govindarajulu added a comment - Operator is now catching the rebalance failure from couchbase-cluster. So closing the ticket.
            ashwin.govindarajulu Ashwin Govindarajulu added a comment - - edited

            Now the rebalance is incomplete, but it is successful as per couchbase-cluster.

            Testcase: TestAnalyticsKillPods

            Attaching respective logs for the same along with this.

            Events:
              Type     Reason               Age   From                                 Message
              ----     ------               ----  ----                                 -------
              Normal   ServiceCreated       11m   couchbase-operator-5bdf548959-smvph  Service for admin console `test-couchbase-jsstw-ui` was created
              Normal   NewMemberAdded       10m   couchbase-operator-5bdf548959-smvph  New member test-couchbase-jsstw-0000 added to cluster
              Normal   NewMemberAdded       10m   couchbase-operator-5bdf548959-smvph  New member test-couchbase-jsstw-0001 added to cluster
              Normal   NewMemberAdded       9m    couchbase-operator-5bdf548959-smvph  New member test-couchbase-jsstw-0002 added to cluster
              Normal   NewMemberAdded       8m    couchbase-operator-5bdf548959-smvph  New member test-couchbase-jsstw-0003 added to cluster
              Normal   NewMemberAdded       7m    couchbase-operator-5bdf548959-smvph  New member test-couchbase-jsstw-0004 added to cluster
              Normal   NewMemberAdded       7m    couchbase-operator-5bdf548959-smvph  New member test-couchbase-jsstw-0005 added to cluster
              Normal   NewMemberAdded       6m    couchbase-operator-5bdf548959-smvph  New member test-couchbase-jsstw-0006 added to cluster
              Normal   NodeServiceCreated   6m    couchbase-operator-5bdf548959-smvph  Node service for analytics was created
              Normal   NodeServiceCreated   6m    couchbase-operator-5bdf548959-smvph  Node service for data was created
              Normal   NodeServiceCreated   6m    couchbase-operator-5bdf548959-smvph  Node service for eventing was created
              Normal   NodeServiceCreated   6m    couchbase-operator-5bdf548959-smvph  Node service for index was created
              Normal   NodeServiceCreated   6m    couchbase-operator-5bdf548959-smvph  Node service for query was created
              Normal   NodeServiceCreated   6m    couchbase-operator-5bdf548959-smvph  Node service for search was created
              Normal   RebalanceStarted     6m    couchbase-operator-5bdf548959-smvph  A rebalance has been started to balance data across the cluster
              Normal   RebalanceCompleted   6m    couchbase-operator-5bdf548959-smvph  A rebalance has completed
              Normal   BucketCreated        5m    couchbase-operator-5bdf548959-smvph  A new bucket `defBucket` was created
              Warning  MemberDown           4m    couchbase-operator-5bdf548959-smvph  Existing member test-couchbase-jsstw-0004 down
              Warning  MemberFailedOver     3m    couchbase-operator-5bdf548959-smvph  Existing member test-couchbase-jsstw-0004 failed over
              Normal   NewMemberAdded       3m    couchbase-operator-5bdf548959-smvph  New member test-couchbase-jsstw-0007 added to cluster
              Normal   RebalanceStarted     3m    couchbase-operator-5bdf548959-smvph  A rebalance has been started to balance data across the cluster
              Normal   RebalanceIncomplete  2m    couchbase-operator-5bdf548959-smvph  A rebalance is incomplete
              Normal   MemberRemoved        2m    couchbase-operator-5bdf548959-smvph  Existing member test-couchbase-jsstw-0004 removed from the cluster

             

            ashwin.govindarajulu Ashwin Govindarajulu added a comment - - edited Now the rebalance is incomplete, but it is successful as per couchbase-cluster. Testcase: TestAnalyticsKillPods Attaching respective logs for the same along with this. Events:   Type     Reason               Age   From                                 Message   ----     ------               ----  ----                                 -------   Normal   ServiceCreated       11m   couchbase-operator-5bdf548959-smvph  Service for admin console `test-couchbase-jsstw-ui` was created   Normal   NewMemberAdded       10m   couchbase-operator-5bdf548959-smvph  New member test-couchbase-jsstw-0000 added to cluster   Normal   NewMemberAdded       10m   couchbase-operator-5bdf548959-smvph  New member test-couchbase-jsstw-0001 added to cluster   Normal   NewMemberAdded       9m    couchbase-operator-5bdf548959-smvph  New member test-couchbase-jsstw-0002 added to cluster   Normal   NewMemberAdded       8m    couchbase-operator-5bdf548959-smvph  New member test-couchbase-jsstw-0003 added to cluster   Normal   NewMemberAdded       7m    couchbase-operator-5bdf548959-smvph  New member test-couchbase-jsstw-0004 added to cluster   Normal   NewMemberAdded       7m    couchbase-operator-5bdf548959-smvph  New member test-couchbase-jsstw-0005 added to cluster   Normal   NewMemberAdded       6m    couchbase-operator-5bdf548959-smvph  New member test-couchbase-jsstw-0006 added to cluster   Normal   NodeServiceCreated   6m    couchbase-operator-5bdf548959-smvph  Node service for analytics was created   Normal   NodeServiceCreated   6m    couchbase-operator-5bdf548959-smvph  Node service for data was created   Normal   NodeServiceCreated   6m    couchbase-operator-5bdf548959-smvph  Node service for eventing was created   Normal   NodeServiceCreated   6m    couchbase-operator-5bdf548959-smvph  Node service for index was created   Normal   NodeServiceCreated   6m    couchbase-operator-5bdf548959-smvph  Node service for query was created   Normal   NodeServiceCreated   6m    couchbase-operator-5bdf548959-smvph  Node service for search was created   Normal   RebalanceStarted     6m    couchbase-operator-5bdf548959-smvph  A rebalance has been started to balance data across the cluster   Normal   RebalanceCompleted   6m    couchbase-operator-5bdf548959-smvph  A rebalance has completed   Normal   BucketCreated        5m    couchbase-operator-5bdf548959-smvph  A new bucket `defBucket` was created   Warning  MemberDown           4m    couchbase-operator-5bdf548959-smvph  Existing member test-couchbase-jsstw-0004 down   Warning  MemberFailedOver     3m    couchbase-operator-5bdf548959-smvph  Existing member test-couchbase-jsstw-0004 failed over   Normal   NewMemberAdded       3m    couchbase-operator-5bdf548959-smvph  New member test-couchbase-jsstw-0007 added to cluster   Normal   RebalanceStarted     3m    couchbase-operator-5bdf548959-smvph  A rebalance has been started to balance data across the cluster   Normal   RebalanceIncomplete  2m    couchbase-operator-5bdf548959-smvph  A rebalance is incomplete   Normal   MemberRemoved        2m    couchbase-operator-5bdf548959-smvph  Existing member test-couchbase-jsstw-0004 removed from the cluster  
            simon.murray Simon Murray added a comment -

            This is a race condition with NS server if that is the case.  The only thing I can suggest is adding in a retry loop to triple check.

            simon.murray Simon Murray added a comment - This is a race condition with NS server if that is the case.  The only thing I can suggest is adding in a retry loop to triple check.
            simon.murray Simon Murray added a comment -

            Okay that is not the problem

            Rebalance exited with reason {service_rebalance_failed,cbas, {lost_connection,shutdown}}

            I suggest you raise a bug against CBAS.

            simon.murray Simon Murray added a comment - Okay that is not the problem Rebalance exited with reason {service_rebalance_failed,cbas, {lost_connection,shutdown }} I suggest you raise a bug against CBAS.

            Simon,

            Where did you see that CBAS message? I'm not seeing it in any of the logs.

            • Mike
            mikew Mike Wiederhold [X] (Inactive) added a comment - Simon, Where did you see that CBAS message? I'm not seeing it in any of the logs. Mike
            simon.murray Simon Murray added a comment -

            I think it was raised in the UI logs

            simon.murray Simon Murray added a comment - I think it was raised in the UI logs

            There are different problems that were both discussed on this issue. The problem in July caused by an eventing issue and the operator did the right thing. The problem in August is a duplicate of K8S-543.

            mikew Mike Wiederhold [X] (Inactive) added a comment - There are different problems that were both discussed on this issue. The problem in July caused by an eventing issue and the operator did the right thing. The problem in August is a duplicate of K8S-543 .

            Closing this bug.

            Verified this scenario using server build Enterprise Edition 6.0.0 build 1550 and operator 1.0.0-418

            ashwin.govindarajulu Ashwin Govindarajulu added a comment - Closing this bug. Verified this scenario using server build Enterprise Edition 6.0.0 build 1550 and operator 1.0.0-418

            People

              simon.murray Simon Murray
              ashwin.govindarajulu Ashwin Govindarajulu
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty