Multi node auto failover: Cb Cluster failover the node, but couchbase-operator didn't catch that event
Description
Environment
Release Notes Description
Attachments
- 09 Aug 2018, 06:14 AM
- 07 Aug 2018, 11:16 AM
Activity
Ashwin Govindarajulu August 27, 2018 at 7:23 AM
https://couchbasecloud.atlassian.net/browse/K8S-537#icft=K8S-537 reports similar issue. So closing this ticket.
Mike Wiederhold August 13, 2018 at 6:30 PM
Network flakiness seems to have contributed to behavior that wasn't expected in a steady state use case. The operator did properly handle getting the cluster back to a stable state given the circumstance so I am going to close this as done since I don't think there are any issues that need to be fixed.
Mike Wiederhold August 9, 2018 at 6:13 AM
After investigating this further it looks like the auto-failover didn't take place as quickly as expected likely due to network flakiness. I did find some strange behavior with node 0003 moving between down, active, and add back states, but eventually the right behavior happened. I'm going to discuss this behavior with Dave Finlay tomorrow, but I don't think the are any testing or operator changes needed as a result of this issue.
Simon Murray August 7, 2018 at 2:52 PM
Odd, you'll need to check the NS server logs and see why it thinks the node was in the failed add state. Again we only report what it tells us.
Ashwin Govindarajulu August 7, 2018 at 12:26 PM
But as per the operator only, the cluster is balanced and healthy with all 9 nodes before killing the cluster pods.
- count: 1
eventTime: null
firstTimestamp: 2018-08-07T01:02:58Z
involvedObject:
apiVersion: couchbase.database.couchbase.com/v1
kind: CouchbaseCluster
name: test-couchbase-r5j6p
namespace: default
resourceVersion: "48381"
uid: c3d66017-99dc-11e8-a5f8-bacf8e15f9bf
lastTimestamp: 2018-08-07T01:02:58Z
message: A rebalance has been started to balance data across the cluster
metadata:
creationTimestamp: 2018-08-07T01:02:58Z
generateName: test-couchbase-r5j6p-
name: test-couchbase-r5j6p-lw9z6
namespace: default
resourceVersion: "48407"
selfLink: /api/v1/namespaces/default/events/test-couchbase-r5j6p-lw9z6
uid: a00aff97-99dd-11e8-a5f8-bacf8e15f9bf
reason: RebalanceStarted
reportingComponent: ""
reportingInstance: ""
source:
component: couchbase-operator-f79c88c9b-bfm8z
type: Normal
- count: 1
eventTime: null
firstTimestamp: 2018-08-07T01:03:14Z
involvedObject:
apiVersion: couchbase.database.couchbase.com/v1
kind: CouchbaseCluster
name: test-couchbase-r5j6p
namespace: default
resourceVersion: "48408"
uid: c3d66017-99dc-11e8-a5f8-bacf8e15f9bf
lastTimestamp: 2018-08-07T01:03:14Z
message: A rebalance has completed
metadata:
creationTimestamp: 2018-08-07T01:03:14Z
generateName: test-couchbase-r5j6p-
name: test-couchbase-r5j6p-6vqkx
namespace: default
resourceVersion: "48445"
selfLink: /api/v1/namespaces/default/events/test-couchbase-r5j6p-6vqkx
uid: a9b490f1-99dd-11e8-a5f8-bacf8e15f9bf
reason: RebalanceCompleted
reportingComponent: ""
reportingInstance: ""
source:
component: couchbase-operator-f79c88c9b-bfm8z
type: Normal
TestCase: TestMultiNodeAutoFailover
Scenario:
Spawn 9 node cluster
Kill nodes 0002, 0003, 0004 so that multinode failover to happen
In the couchbase-cluster logs it says it failed over the node 0003.
But operator events didn't catch that event.
#3604: [user:info,2018-08-07T01:04:45.306Z,ns_1@test-couchbase-r5j6p-0000.test-couchbase-r5j6p.default.svc:<0.709.0>:auto_failover:log_failover_success:561]Node ('ns_1@test-couchbase-r5j6p-0003.test-couchbas e-r5j6p.default.svc') was automatically failed over. Reason: All monitors report node is unhealthy.
Operator events:
util.go:189: Expected events to be: Type: Normal | Reason: ServiceCreated | Message: Service for admin console `test-couchbase-r5j6p-ui` was created Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0000 added to cluster Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0001 added to cluster Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0002 added to cluster Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0003 added to cluster Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0004 added to cluster Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0005 added to cluster Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0006 added to cluster Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0007 added to cluster Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0008 added to cluster Type: Normal | Reason: RebalanceStarted | Message: A rebalance has been started to balance data across the cluster Type: Normal | Reason: RebalanceCompleted | Message: A rebalance has completed Type: Normal | Reason: BucketCreated | Message: A new bucket `default` was created Type: Warning | Reason: MemberDown | Message: Existing member test-couchbase-r5j6p-0002 down Type: Warning | Reason: MemberDown | Message: Existing member test-couchbase-r5j6p-0003 down Type: Warning | Reason: MemberDown | Message: Existing member test-couchbase-r5j6p-0004 down Type: Warning | Reason: MemberFailedOver | Message: Existing member test-couchbase-r5j6p-0002 failed over Type: Warning | Reason: MemberFailedOver | Message: Existing member test-couchbase-r5j6p-0003 failed over Type: Warning | Reason: MemberFailedOver | Message: Existing member test-couchbase-r5j6p-0004 failed over Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0009 added to cluster Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0010 added to cluster Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0011 added to cluster Type: Normal | Reason: RebalanceStarted | Message: A rebalance has been started to balance data across the cluster Type: Normal | Reason: MemberRemoved | Message: Existing member test-couchbase-r5j6p-0002 removed from the cluster Type: Normal | Reason: MemberRemoved | Message: Existing member test-couchbase-r5j6p-0003 removed from the cluster Type: Normal | Reason: MemberRemoved | Message: Existing member test-couchbase-r5j6p-0004 removed from the cluster Type: Normal | Reason: RebalanceCompleted | Message: A rebalance has completed but got: Type: Normal | Reason: ServiceCreated | Message: Service for admin console `test-couchbase-r5j6p-ui` was created Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0000 added to cluster Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0001 added to cluster Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0002 added to cluster Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0003 added to cluster Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0004 added to cluster Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0005 added to cluster Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0006 added to cluster Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0007 added to cluster Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0008 added to cluster Type: Normal | Reason: RebalanceStarted | Message: A rebalance has been started to balance data across the cluster Type: Normal | Reason: RebalanceCompleted | Message: A rebalance has completed Type: Normal | Reason: BucketCreated | Message: A new bucket `default` was created Type: Warning | Reason: MemberDown | Message: Existing member test-couchbase-r5j6p-0002 down Type: Warning | Reason: MemberDown | Message: Existing member test-couchbase-r5j6p-0003 down Type: Warning | Reason: MemberDown | Message: Existing member test-couchbase-r5j6p-0004 down Type: Warning | Reason: MemberFailedOver | Message: Existing member test-couchbase-r5j6p-0002 failed over Type: Warning | Reason: MemberFailedOver | Message: Existing member test-couchbase-r5j6p-0004 failed over Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0009 added to cluster Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0010 added to cluster Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0011 added to cluster Type: Normal | Reason: RebalanceStarted | Message: A rebalance has been started to balance data across the cluster Type: Normal | Reason: FailedAddNode | Message: Removed existing member test-couchbase-r5j6p-0003 because it failed before it could be added to the cluster Type: Normal | Reason: RebalanceStarted | Message: A rebalance has been started to balance data across the cluster Type: Normal | Reason: MemberRemoved | Message: Existing member test-couchbase-r5j6p-0002 removed from the cluster Type: Normal | Reason: MemberRemoved | Message: Existing member test-couchbase-r5j6p-0004 removed from the cluster Type: Normal | Reason: RebalanceCompleted | Message: A rebalance has completed