Uploaded image for project: 'Couchbase Kubernetes'
  1. Couchbase Kubernetes
  2. K8S-543

rebalance status loop exits early and reports rebalance as incomplete

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • None
    • 1.0.0
    • operator
    • None

    Description

      TestResizeClusterWithBucket

      If a rebalance is attempted and the first attempt fails, it will be retried again. Once the rebalance successfully completes, only a rebalance incomplete event is seen. It would make more sense to report rebalance complete in the scenario and only report rebalance incomplete in scenarios where all retries have been exhausted. Otherwise, tests will fail like this:

      Expected events to be:
          		Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-mjwqv-0000 added to cluster
          		Type: Normal | Reason: BucketCreated | Message: A new bucket `default` was created
          		Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-mjwqv-0001 added to cluster
          		Type: Normal | Reason: RebalanceStarted | Message: A rebalance has been started to balance data across the cluster
          		Type: Normal | Reason: RebalanceCompleted | Message: A rebalance has completed
          		Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-mjwqv-0002 added to cluster
          		Type: Normal | Reason: RebalanceStarted | Message: A rebalance has been started to balance data across the cluster
          		Type: Normal | Reason: RebalanceCompleted | Message: A rebalance has completed
          		Type: Normal | Reason: RebalanceStarted | Message: A rebalance has been started to balance data across the cluster
          		Type: Normal | Reason: MemberRemoved | Message: Existing member test-couchbase-mjwqv-0002 removed from the cluster
          		Type: Normal | Reason: RebalanceCompleted | Message: A rebalance has completed
          		Type: Normal | Reason: RebalanceStarted | Message: A rebalance has been started to balance data across the cluster
          		Type: Normal | Reason: MemberRemoved | Message: Existing member test-couchbase-mjwqv-0001 removed from the cluster
          		Type: Normal | Reason: RebalanceCompleted | Message: A rebalance has completed
          		
          		but got:
          		Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-mjwqv-0000 added to cluster
          		Type: Normal | Reason: BucketCreated | Message: A new bucket `default` was created
          		Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-mjwqv-0001 added to cluster
          		Type: Normal | Reason: RebalanceStarted | Message: A rebalance has been started to balance data across the cluster
          		Type: Normal | Reason: RebalanceCompleted | Message: A rebalance has completed
          		Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-mjwqv-0002 added to cluster
          		Type: Normal | Reason: RebalanceStarted | Message: A rebalance has been started to balance data across the cluster
          		Type: Normal | Reason: RebalanceCompleted | Message: A rebalance has completed
          		Type: Normal | Reason: RebalanceStarted | Message: A rebalance has been started to balance data across the cluster
          		Type: Normal | Reason: MemberRemoved | Message: Existing member test-couchbase-mjwqv-0002 removed from the cluster
          		Type: Normal | Reason: RebalanceCompleted | Message: A rebalance has completed
          		Type: Normal | Reason: RebalanceStarted | Message: A rebalance has been started to balance data across the cluster
          		Type: Normal | Reason: RebalanceIncomplete | Message: A rebalance is incomplete
          		Type: Normal | Reason: MemberRemoved | Message: Existing member test-couchbase-mjwqv-0001 removed from the cluster
      

      This looks like rebalanced failed, but the cluster status before the events are checked is "balanced", ""healthy", and "ready".

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          From looking at the logs this issue is caused by the rebalance status loop exiting early due to an error being returned by the server. In this case the logs show the following message while the rebalance status loop is running:

          2018/08/15 23:59:06 Unsolicited response received on idle HTTP channel starting with "HTTP/1.1 400 Bad Request\r\nServer: MochiWeb/1.0 (Any of you quaids got a smint?)\r\nDate: Wed, 15 Aug 2018 23:59:06 GMT\r\nContent-Length: 0\r\n\r\n"; err=<nil>

          The loop exits and then when we check to see if a rebalance is still needed ns_server report that it does because the rebalance is still in process.

          time="2018-08-15T23:59:10Z" level=error msg="failed to reconcile: Failed to rebalance: cluster reports rebalance incomplete" cluster-name=test-couchbase-mjwqv module=cluster

          The problem is that we need to do better error checking in the rebalance status loop. If the call to check the status fails we need to check another node. Even if all nodes fail we should still retry for a certain amount of time (maybe 60 seconds) before giving up. Below is a link the code that needs to be improved.

          https://github.com/couchbase/gocbmgr/blob/master/api.go#L132

          It should also be noted that it's possible that we may not be able to check the status for 60 seconds and in that case we should skip raising an event or raise an event for rebalance status unknown.

          mikew Mike Wiederhold [X] (Inactive) added a comment - From looking at the logs this issue is caused by the rebalance status loop exiting early due to an error being returned by the server. In this case the logs show the following message while the rebalance status loop is running: 2018 / 08 / 15 23 : 59 : 06 Unsolicited response received on idle HTTP channel starting with "HTTP/1.1 400 Bad Request\r\nServer: MochiWeb/1.0 (Any of you quaids got a smint?)\r\nDate: Wed, 15 Aug 2018 23:59:06 GMT\r\nContent-Length: 0\r\n\r\n" ; err=<nil> The loop exits and then when we check to see if a rebalance is still needed ns_server report that it does because the rebalance is still in process. time= "2018-08-15T23:59:10Z" level=error msg= "failed to reconcile: Failed to rebalance: cluster reports rebalance incomplete" cluster-name=test-couchbase-mjwqv module=cluster The problem is that we need to do better error checking in the rebalance status loop. If the call to check the status fails we need to check another node. Even if all nodes fail we should still retry for a certain amount of time (maybe 60 seconds) before giving up. Below is a link the code that needs to be improved. https://github.com/couchbase/gocbmgr/blob/master/api.go#L132 It should also be noted that it's possible that we may not be able to check the status for 60 seconds and in that case we should skip raising an event or raise an event for rebalance status unknown.

          In order to get more information on this issue I've added a change to add debug logging to gocbmgr. Once this change gets merged we should be able to re-run the tests and then put a proper fix in for this issue.

          http://review.couchbase.org/#/c/98485/

          mikew Mike Wiederhold [X] (Inactive) added a comment - In order to get more information on this issue I've added a change to add debug logging to gocbmgr. Once this change gets merged we should be able to re-run the tests and then put a proper fix in for this issue. http://review.couchbase.org/#/c/98485/

          In the latest run we still see "rebalance is incomplete" event. This by itself is not an issue, the issue is that there is no terminating "rebalance completed" event for this particular rebalance. This it is impossible to wait for rebalance completion as it is never reported.

          Search the console logs for "rebalance is incomplete": http://qa.sc.couchbase.com/view/Cloud/job/trigger-kubernetes-P0/461/consoleFull

          Artifacts can be found here: http://qa.sc.couchbase.com/view/Cloud/job/trigger-kubernetes-P0/461/

          korrigan.clark Korrigan Clark added a comment - In the latest run we still see "rebalance is incomplete" event. This by itself is not an issue, the issue is that there is no terminating "rebalance completed" event for this particular rebalance. This it is impossible to wait for rebalance completion as it is never reported. Search the console logs for "rebalance is incomplete": http://qa.sc.couchbase.com/view/Cloud/job/trigger-kubernetes-P0/461/consoleFull Artifacts can be found here:  http://qa.sc.couchbase.com/view/Cloud/job/trigger-kubernetes-P0/461/

          Type: Normal | Reason: RebalanceStarted | Message: A rebalance has been started to balance data across the cluster
              		Type: Normal | Reason: RebalanceIncomplete | Message: A rebalance is incomplete
              		Type: Normal | Reason: MemberRemoved | Message: Existing member test-couchbase-dg27k-0000 removed from the cluster
              		Type: Normal | Reason: MemberRemoved | Message: Existing member test-couchbase-dg27k-0001 removed from the cluster
              		Type: Normal | Reason: RebalanceStarted | Message: A rebalance has been started to balance data across the cluster
              		Type: Normal | Reason: RebalanceCompleted | Message: A rebalance has completed

          From these events you can see 2 rebalances are started but only the last one if reported as complete

          korrigan.clark Korrigan Clark added a comment - Type: Normal | Reason: RebalanceStarted | Message: A rebalance has been started to balance data across the cluster Type: Normal | Reason: RebalanceIncomplete | Message: A rebalance is incomplete Type: Normal | Reason: MemberRemoved | Message: Existing member test-couchbase-dg27k- 0000 removed from the cluster Type: Normal | Reason: MemberRemoved | Message: Existing member test-couchbase-dg27k- 0001 removed from the cluster Type: Normal | Reason: RebalanceStarted | Message: A rebalance has been started to balance data across the cluster Type: Normal | Reason: RebalanceCompleted | Message: A rebalance has completed From these events you can see 2 rebalances are started but only the last one if reported as complete

          The behavior reported is expected. The rebalance did fail and as a result we exited out of the reconcile loop. Once the reconcile loop runs again nodes 0001 and 0000 and no longer part of the cluster because they were failed over and they get removed. We then see that the cluster is still not balances and start the rebalance again.

          mikew Mike Wiederhold [X] (Inactive) added a comment - The behavior reported is expected. The rebalance did fail and as a result we exited out of the reconcile loop. Once the reconcile loop runs again nodes 0001 and 0000 and no longer part of the cluster because they were failed over and they get removed. We then see that the cluster is still not balances and start the rebalance again.

          People

            mikew Mike Wiederhold [X] (Inactive)
            korrigan.clark Korrigan Clark
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                PagerDuty