Uploaded image for project: 'Couchbase Kubernetes'
  1. Couchbase Kubernetes
  2. K8S-113

Unable to reconcile when service dies while scaling down

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • 1.0.0
    • 0.7.0
    • operator

    Description

      May need to modify the order of operations when dealing with rebalance failure on scale down by prioritizing the removal of any down nodes first. 

      Currently,  a failure when scaling down causes the node being removed to become a 'dead member' and so the operator removes it.  ie.. here rebalance  fails at 0% followed by member being removed:

      time="2018-01-11T23:26:20Z" level=info msg="Rebalance progress: 0.000000" cluster-name=test-couchbase-6q66p module=retryutil
       
      time="2018-01-11T23:26:30Z" level=info msg="removing dead member \"test-couchbase-6q66p-0004\"" cluster-name=test-couchbase-6q66p module=cluster
       
      time="2018-01-11T23:26:30Z" level=info msg="deleted pod (test-couchbase-6q66p-0004)" cluster-name=test-couchbase-6q66p module=cluster
       
      time="2018-01-11T23:26:30Z" level=info msg="removed member (test-couchbase-6q66p-0004)" cluster-name=test-couchbase-6q66p module=cluster

       

      At this point the pod is removed but the node is still a part of the cluster so the operator tracks it as an unmanaged node, while the down node still needs to be removed:

      time="2018-01-11T23:26:38Z" level=info msg="running members: test-couchbase-6q66p-0003,test-couchbase-6q66p-0000,test-couchbase-6q66p-0001,test-couchbase-6q66p-0002" cluster-name=test-couchbase-6q66p module=cluster
       
      time="2018-01-11T23:26:38Z" level=info msg="cluster membership: test-couchbase-6q66p-0000,test-couchbase-6q66p-0001,test-couchbase-6q66p-0002,test-couchbase-6q66p-0003" cluster-name=test-couchbase-6q66p module=cluster
       
      time="2018-01-11T23:26:38Z" level=info msg="down nodes: test-couchbase-6q66p-0000" cluster-name=test-couchbase-6q66p module=cluster
       
      time="2018-01-11T23:26:38Z" level=info msg="unmanaged nodes: [test-couchbase-6q66p-0004.test-couchbase-6q66p.default.svc:8091]" cluster-name=test-couchbase-6q66p module=cluster

       

      From this point on the operator reports that the cluster is still rebalancing - which blocks additional reconciling

      time="2018-01-11T23:32:24Z" level=info msg="Skipping reconcile loop because the cluster is currently rebalancing" cluster-name=test-couchbase-6q66p module=cluster

      However the cluster isn't actually rebalancing - which may be a server issue (TBD).   But probably we can avoid getting into this state altogether if different actions are taken when first rebalance fails.

       

       

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            simon.murray Simon Murray
            tommie Tommie McAfee (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty