Uploaded image for project: 'Couchbase Kubernetes'
  1. Couchbase Kubernetes
  2. K8S-516

Multi node auto failover: Cb Cluster failover the node, but couchbase-operator didn't catch that event

    XMLWordPrintable

Details

    Description

      TestCase: TestMultiNodeAutoFailover

      Scenario:

      1. Spawn 9 node cluster
      2. Kill nodes 0002, 0003, 0004 so that multinode failover to happen

      In the couchbase-cluster logs it says it failed over the node 0003.

      But operator events didn't catch that event.

      #3604: [user:info,2018-08-07T01:04:45.306Z,ns_1@test-couchbase-r5j6p-0000.test-couchbase-r5j6p.default.svc:<0.709.0>:auto_failover:log_failover_success:561]Node ('ns_1@test-couchbase-r5j6p-0003.test-couchbas      e-r5j6p.default.svc') was automatically failed over. Reason: All monitors report node is unhealthy.

      Operator events:

      util.go:189: Expected events to be:
          		Type: Normal | Reason: ServiceCreated | Message: Service for admin console `test-couchbase-r5j6p-ui` was created
          		Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0000 added to cluster
          		Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0001 added to cluster
          		Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0002 added to cluster
          		Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0003 added to cluster
          		Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0004 added to cluster
          		Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0005 added to cluster
          		Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0006 added to cluster
          		Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0007 added to cluster
          		Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0008 added to cluster
          		Type: Normal | Reason: RebalanceStarted | Message: A rebalance has been started to balance data across the cluster
          		Type: Normal | Reason: RebalanceCompleted | Message: A rebalance has completed
          		Type: Normal | Reason: BucketCreated | Message: A new bucket `default` was created
          		Type: Warning | Reason: MemberDown | Message: Existing member test-couchbase-r5j6p-0002 down
          		Type: Warning | Reason: MemberDown | Message: Existing member test-couchbase-r5j6p-0003 down
          		Type: Warning | Reason: MemberDown | Message: Existing member test-couchbase-r5j6p-0004 down
          		Type: Warning | Reason: MemberFailedOver | Message: Existing member test-couchbase-r5j6p-0002 failed over
          		Type: Warning | Reason: MemberFailedOver | Message: Existing member test-couchbase-r5j6p-0003 failed over
          		Type: Warning | Reason: MemberFailedOver | Message: Existing member test-couchbase-r5j6p-0004 failed over
          		Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0009 added to cluster
          		Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0010 added to cluster
          		Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0011 added to cluster
          		Type: Normal | Reason: RebalanceStarted | Message: A rebalance has been started to balance data across the cluster
          		Type: Normal | Reason: MemberRemoved | Message: Existing member test-couchbase-r5j6p-0002 removed from the cluster
          		Type: Normal | Reason: MemberRemoved | Message: Existing member test-couchbase-r5j6p-0003 removed from the cluster
          		Type: Normal | Reason: MemberRemoved | Message: Existing member test-couchbase-r5j6p-0004 removed from the cluster
          		Type: Normal | Reason: RebalanceCompleted | Message: A rebalance has completed
          		
          		but got:
          		Type: Normal | Reason: ServiceCreated | Message: Service for admin console `test-couchbase-r5j6p-ui` was created
          		Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0000 added to cluster
          		Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0001 added to cluster
          		Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0002 added to cluster
          		Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0003 added to cluster
          		Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0004 added to cluster
          		Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0005 added to cluster
          		Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0006 added to cluster
          		Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0007 added to cluster
          		Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0008 added to cluster
          		Type: Normal | Reason: RebalanceStarted | Message: A rebalance has been started to balance data across the cluster
          		Type: Normal | Reason: RebalanceCompleted | Message: A rebalance has completed
          		Type: Normal | Reason: BucketCreated | Message: A new bucket `default` was created
          		Type: Warning | Reason: MemberDown | Message: Existing member test-couchbase-r5j6p-0002 down
          		Type: Warning | Reason: MemberDown | Message: Existing member test-couchbase-r5j6p-0003 down
          		Type: Warning | Reason: MemberDown | Message: Existing member test-couchbase-r5j6p-0004 down
          		Type: Warning | Reason: MemberFailedOver | Message: Existing member test-couchbase-r5j6p-0002 failed over
          		Type: Warning | Reason: MemberFailedOver | Message: Existing member test-couchbase-r5j6p-0004 failed over
          		Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0009 added to cluster
          		Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0010 added to cluster
          		Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0011 added to cluster
          		Type: Normal | Reason: RebalanceStarted | Message: A rebalance has been started to balance data across the cluster
          		Type: Normal | Reason: FailedAddNode | Message: Removed existing member test-couchbase-r5j6p-0003 because it failed before it could be added to the cluster
          		Type: Normal | Reason: RebalanceStarted | Message: A rebalance has been started to balance data across the cluster
          		Type: Normal | Reason: MemberRemoved | Message: Existing member test-couchbase-r5j6p-0002 removed from the cluster
          		Type: Normal | Reason: MemberRemoved | Message: Existing member test-couchbase-r5j6p-0004 removed from the cluster
          		Type: Normal | Reason: RebalanceCompleted | Message: A rebalance has completed

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          simon.murray Simon Murray added a comment -

          If you look closely you killed it before it was added to the cluster:

          Type: Normal | Reason: FailedAddNode | Message: Removed existing member test-couchbase-r5j6p-0003 because it failed before it could be added to the cluster

           

          simon.murray Simon Murray added a comment - If you look closely you killed it before it was added to the cluster: Type: Normal | Reason: FailedAddNode | Message: Removed existing member test-couchbase-r5j6p-0003 because it failed before it could be added to the cluster  

          But as per the operator only, the cluster is balanced and healthy with all 9 nodes before killing the cluster pods.

           

          - count: 1
            eventTime: null
            firstTimestamp: 2018-08-07T01:02:58Z
            involvedObject:
              apiVersion: couchbase.database.couchbase.com/v1
              kind: CouchbaseCluster
              name: test-couchbase-r5j6p
              namespace: default
              resourceVersion: "48381"
              uid: c3d66017-99dc-11e8-a5f8-bacf8e15f9bf
            lastTimestamp: 2018-08-07T01:02:58Z
            message: A rebalance has been started to balance data across the cluster
            metadata:
              creationTimestamp: 2018-08-07T01:02:58Z
              generateName: test-couchbase-r5j6p-
              name: test-couchbase-r5j6p-lw9z6
              namespace: default
              resourceVersion: "48407"
              selfLink: /api/v1/namespaces/default/events/test-couchbase-r5j6p-lw9z6
              uid: a00aff97-99dd-11e8-a5f8-bacf8e15f9bf
            reason: RebalanceStarted
            reportingComponent: ""
            reportingInstance: ""
            source:
              component: couchbase-operator-f79c88c9b-bfm8z
            type: Normal
          - count: 1
            eventTime: null
            firstTimestamp: 2018-08-07T01:03:14Z
            involvedObject:
              apiVersion: couchbase.database.couchbase.com/v1
              kind: CouchbaseCluster
              name: test-couchbase-r5j6p
              namespace: default
              resourceVersion: "48408"
              uid: c3d66017-99dc-11e8-a5f8-bacf8e15f9bf
            lastTimestamp: 2018-08-07T01:03:14Z
            message: A rebalance has completed
            metadata:
              creationTimestamp: 2018-08-07T01:03:14Z
              generateName: test-couchbase-r5j6p-
              name: test-couchbase-r5j6p-6vqkx
              namespace: default
              resourceVersion: "48445"
              selfLink: /api/v1/namespaces/default/events/test-couchbase-r5j6p-6vqkx
              uid: a9b490f1-99dd-11e8-a5f8-bacf8e15f9bf
            reason: RebalanceCompleted
            reportingComponent: ""
            reportingInstance: ""
            source:
              component: couchbase-operator-f79c88c9b-bfm8z
            type: Normal

           

          ashwin.govindarajulu Ashwin Govindarajulu added a comment - But as per the operator only, the cluster is balanced and healthy with all 9 nodes before killing the cluster pods.   - count: 1   eventTime: null   firstTimestamp: 2018-08-07T01:02:58Z   involvedObject:     apiVersion: couchbase.database.couchbase.com/v1     kind: CouchbaseCluster     name: test-couchbase-r5j6p     namespace: default     resourceVersion: "48381"     uid: c3d66017-99dc-11e8-a5f8-bacf8e15f9bf   lastTimestamp: 2018-08-07T01:02:58Z   message: A rebalance has been started to balance data across the cluster   metadata:     creationTimestamp: 2018-08-07T01:02:58Z     generateName: test-couchbase-r5j6p-     name: test-couchbase-r5j6p-lw9z6     namespace: default     resourceVersion: "48407"     selfLink: /api/v1/namespaces/default/events/test-couchbase-r5j6p-lw9z6     uid: a00aff97-99dd-11e8-a5f8-bacf8e15f9bf   reason: RebalanceStarted   reportingComponent: ""   reportingInstance: ""   source:     component: couchbase-operator-f79c88c9b-bfm8z   type: Normal - count: 1   eventTime: null   firstTimestamp: 2018-08-07T01:03:14Z   involvedObject:     apiVersion: couchbase.database.couchbase.com/v1     kind: CouchbaseCluster     name: test-couchbase-r5j6p     namespace: default     resourceVersion: "48408"     uid: c3d66017-99dc-11e8-a5f8-bacf8e15f9bf   lastTimestamp: 2018-08-07T01:03:14Z   message: A rebalance has completed   metadata:     creationTimestamp: 2018-08-07T01:03:14Z     generateName: test-couchbase-r5j6p-     name: test-couchbase-r5j6p-6vqkx     namespace: default     resourceVersion: "48445"     selfLink: /api/v1/namespaces/default/events/test-couchbase-r5j6p-6vqkx     uid: a9b490f1-99dd-11e8-a5f8-bacf8e15f9bf   reason: RebalanceCompleted   reportingComponent: ""   reportingInstance: ""   source:     component: couchbase-operator-f79c88c9b-bfm8z   type: Normal  
          simon.murray Simon Murray added a comment -

          Odd, you'll need to check the NS server logs and see why it thinks the node was in the failed add state.  Again we only report what it tells us.

          simon.murray Simon Murray added a comment - Odd, you'll need to check the NS server logs and see why it thinks the node was in the failed add state.  Again we only report what it tells us.

          After investigating this further it looks like the auto-failover didn't take place as quickly as expected likely due to network flakiness. I did find some strange behavior with node 0003 moving between down, active, and add back states, but eventually the right behavior happened. I'm going to discuss this behavior with Dave Finlay tomorrow, but I don't think the are any testing or operator changes needed as a result of this issue.

          mikew Mike Wiederhold [X] (Inactive) added a comment - After investigating this further it looks like the auto-failover didn't take place as quickly as expected likely due to network flakiness. I did find some strange behavior with node 0003 moving between down, active, and add back states, but eventually the right behavior happened. I'm going to discuss this behavior with Dave Finlay tomorrow, but I don't think the are any testing or operator changes needed as a result of this issue.

          Network flakiness seems to have contributed to behavior that wasn't expected in a steady state use case. The operator did properly handle getting the cluster back to a stable state given the circumstance so I am going to close this as done since I don't think there are any issues that need to be fixed.

          mikew Mike Wiederhold [X] (Inactive) added a comment - Network flakiness seems to have contributed to behavior that wasn't expected in a steady state use case. The operator did properly handle getting the cluster back to a stable state given the circumstance so I am going to close this as done since I don't think there are any issues that need to be fixed.

          K8S-537 reports similar issue. So closing this ticket.

          ashwin.govindarajulu Ashwin Govindarajulu added a comment - K8S-537 reports similar issue. So closing this ticket.

          People

            mikew Mike Wiederhold [X] (Inactive)
            ashwin.govindarajulu Ashwin Govindarajulu
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty