Uploaded image for project: 'Couchbase Kubernetes'
  1. Couchbase Kubernetes
  2. K8S-598

Cluster is in unbalanced state event after RebalanceComplete event

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Cannot Reproduce
    • 1.1.0
    • 1.1.0
    • operator

    Description

      Testcase: TestDiskFailureAutoFailover

      Rebalance completed event log:

      - count: 1
        eventTime: null
        firstTimestamp: 2018-09-25T18:00:31Z
        involvedObject:
          apiVersion: couchbase.com/v1
          kind: CouchbaseCluster
          name: test-couchbase-dt79w
          namespace: default
          resourceVersion: "31616"
          uid: 81d6f196-c0eb-11e8-8d55-8670c6b8da5c
        lastTimestamp: 2018-09-25T18:00:31Z
        message: A rebalance has completed
        metadata:
          creationTimestamp: 2018-09-25T18:00:31Z
          generateName: test-couchbase-dt79w-
          name: test-couchbase-dt79w-9mxwj
          namespace: default
          resourceVersion: "31844"
          selfLink: /api/v1/namespaces/default/events/test-couchbase-dt79w-9mxwj
          uid: e46c39f5-c0ec-11e8-8d55-8670c6b8da5c
        reason: RebalanceCompleted
        reportingComponent: ""
        reportingInstance: ""
        source:
          component: couchbase-operator-6c7bf9fd66-xw72x
        type: Normal

      Cluster status:

      status:
        buckets:
          testBucket:
            conflictResolution: seqno
            enableFlush: true
            evictionPolicy: fullEviction
            ioPriority: high
            memoryQuota: 100
            name: testBucket
            replicas: 2
            type: couchbase
        clusterId: 1a7599024e1f483172d9f71b453451e3
        conditions:
          Available:
            lastTransitionTime: 2018-09-25T17:51:17Z
            lastUpdateTime: 2018-09-25T17:51:17Z
            reason: Cluster available
            status: "True"
          Balanced:
            lastTransitionTime: 2018-09-25T17:59:00Z
            lastUpdateTime: 2018-09-25T17:59:00Z
            message: The operator is attempting to rebalance the data to correct this issue
            reason: Cluster is unbalanced
            status: "False"
          Scaling:
            lastTransitionTime: 2018-09-25T17:58:15Z
            lastUpdateTime: 2018-09-25T17:58:15Z
            message: 'Current cluster size: 5, desired cluster size: 6'
            reason: Scaling up
            status: "True"
        controlPaused: false
        currentVersion: enterprise-5.5.1
        members:
          index: 7
          ready:
          - Name: test-couchbase-dt79w-0001
          - Name: test-couchbase-dt79w-0002
          - Name: test-couchbase-dt79w-0003
          - Name: test-couchbase-dt79w-0004
          - Name: test-couchbase-dt79w-0005
          unready:
          - Name: test-couchbase-dt79w-0006
        phase: Running
        reason: ""
        size: 6

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          simon.murray Simon Murray added a comment -

          The operator logs show no errors or warnings (especially updating the CR status).  The conditions get cleared when a) the rebalance succeeds and b) at the end of the reconcile function, so we can assume this happened.

          Based on the logs I can only hypothesize that kuberentes/etcd is not persisting our changes.  Now as you run workloads on your master nodes (bad) it's entirely possible you are causing a DoS situation, especially when cbworkloadgen is running on the same node (there could be a delay between the pod being deleted and kubelet doing something).

          I think in order to make more sense of this you need to collect logs with --system specified.

          I note that the last event was "2018-09-25T18:00:31Z" by the the logs collected at "20180925T105943-0700" e.g. the collection started before the events happened.  Please ensure you run NTP on all your infrastructure.

          simon.murray Simon Murray added a comment - The operator logs show no errors or warnings (especially updating the CR status).  The conditions get cleared when a) the rebalance succeeds and b) at the end of the reconcile function, so we can assume this happened. Based on the logs I can only hypothesize that kuberentes/etcd is not persisting our changes.  Now as you run workloads on your master nodes (bad) it's entirely possible you are causing a DoS situation, especially when cbworkloadgen is running on the same node (there could be a delay between the pod being deleted and kubelet doing something). I think in order to make more sense of this you need to collect logs with --system specified. I note that the last event was "2018-09-25T18:00:31Z" by the the logs collected at "20180925T105943-0700" e.g. the collection started before the events happened.  Please ensure you run NTP on all your infrastructure.

          Simon Murray I am not able to reproduce this issue currently.

          Will re-open once we reproduce the issue with logs.

          ashwin.govindarajulu Ashwin Govindarajulu added a comment - Simon Murray I am not able to reproduce this issue currently. Will re-open once we reproduce the issue with logs.

          People

            ashwin.govindarajulu Ashwin Govindarajulu
            ashwin.govindarajulu Ashwin Govindarajulu
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty