Multi node auto failover: Cb Cluster failover the node, but couchbase-operator didn't catch that event

Description

TestCase: TestMultiNodeAutoFailover

Scenario:

  1. Spawn 9 node cluster

  2. Kill nodes 0002, 0003, 0004 so that multinode failover to happen

In the couchbase-cluster logs it says it failed over the node 0003.

But operator events didn't catch that event.

#3604: [user:info,2018-08-07T01:04:45.306Z,ns_1@test-couchbase-r5j6p-0000.test-couchbase-r5j6p.default.svc:<0.709.0>:auto_failover:log_failover_success:561]Node ('ns_1@test-couchbase-r5j6p-0003.test-couchbas      e-r5j6p.default.svc') was automatically failed over. Reason: All monitors report node is unhealthy.

Operator events:

util.go:189: Expected events to be: Type: Normal | Reason: ServiceCreated | Message: Service for admin console `test-couchbase-r5j6p-ui` was created Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0000 added to cluster Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0001 added to cluster Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0002 added to cluster Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0003 added to cluster Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0004 added to cluster Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0005 added to cluster Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0006 added to cluster Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0007 added to cluster Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0008 added to cluster Type: Normal | Reason: RebalanceStarted | Message: A rebalance has been started to balance data across the cluster Type: Normal | Reason: RebalanceCompleted | Message: A rebalance has completed Type: Normal | Reason: BucketCreated | Message: A new bucket `default` was created Type: Warning | Reason: MemberDown | Message: Existing member test-couchbase-r5j6p-0002 down Type: Warning | Reason: MemberDown | Message: Existing member test-couchbase-r5j6p-0003 down Type: Warning | Reason: MemberDown | Message: Existing member test-couchbase-r5j6p-0004 down Type: Warning | Reason: MemberFailedOver | Message: Existing member test-couchbase-r5j6p-0002 failed over Type: Warning | Reason: MemberFailedOver | Message: Existing member test-couchbase-r5j6p-0003 failed over Type: Warning | Reason: MemberFailedOver | Message: Existing member test-couchbase-r5j6p-0004 failed over Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0009 added to cluster Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0010 added to cluster Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0011 added to cluster Type: Normal | Reason: RebalanceStarted | Message: A rebalance has been started to balance data across the cluster Type: Normal | Reason: MemberRemoved | Message: Existing member test-couchbase-r5j6p-0002 removed from the cluster Type: Normal | Reason: MemberRemoved | Message: Existing member test-couchbase-r5j6p-0003 removed from the cluster Type: Normal | Reason: MemberRemoved | Message: Existing member test-couchbase-r5j6p-0004 removed from the cluster Type: Normal | Reason: RebalanceCompleted | Message: A rebalance has completed but got: Type: Normal | Reason: ServiceCreated | Message: Service for admin console `test-couchbase-r5j6p-ui` was created Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0000 added to cluster Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0001 added to cluster Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0002 added to cluster Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0003 added to cluster Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0004 added to cluster Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0005 added to cluster Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0006 added to cluster Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0007 added to cluster Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0008 added to cluster Type: Normal | Reason: RebalanceStarted | Message: A rebalance has been started to balance data across the cluster Type: Normal | Reason: RebalanceCompleted | Message: A rebalance has completed Type: Normal | Reason: BucketCreated | Message: A new bucket `default` was created Type: Warning | Reason: MemberDown | Message: Existing member test-couchbase-r5j6p-0002 down Type: Warning | Reason: MemberDown | Message: Existing member test-couchbase-r5j6p-0003 down Type: Warning | Reason: MemberDown | Message: Existing member test-couchbase-r5j6p-0004 down Type: Warning | Reason: MemberFailedOver | Message: Existing member test-couchbase-r5j6p-0002 failed over Type: Warning | Reason: MemberFailedOver | Message: Existing member test-couchbase-r5j6p-0004 failed over Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0009 added to cluster Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0010 added to cluster Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-r5j6p-0011 added to cluster Type: Normal | Reason: RebalanceStarted | Message: A rebalance has been started to balance data across the cluster Type: Normal | Reason: FailedAddNode | Message: Removed existing member test-couchbase-r5j6p-0003 because it failed before it could be added to the cluster Type: Normal | Reason: RebalanceStarted | Message: A rebalance has been started to balance data across the cluster Type: Normal | Reason: MemberRemoved | Message: Existing member test-couchbase-r5j6p-0002 removed from the cluster Type: Normal | Reason: MemberRemoved | Message: Existing member test-couchbase-r5j6p-0004 removed from the cluster Type: Normal | Reason: RebalanceCompleted | Message: A rebalance has completed

Environment

None

Release Notes Description

None

Attachments

2
  • 09 Aug 2018, 06:14 AM
  • 07 Aug 2018, 11:16 AM

Activity

Show:

Ashwin Govindarajulu August 27, 2018 at 7:23 AM

https://couchbasecloud.atlassian.net/browse/K8S-537#icft=K8S-537 reports similar issue. So closing this ticket.

Mike Wiederhold August 13, 2018 at 6:30 PM

Network flakiness seems to have contributed to behavior that wasn't expected in a steady state use case. The operator did properly handle getting the cluster back to a stable state given the circumstance so I am going to close this as done since I don't think there are any issues that need to be fixed.

Mike Wiederhold August 9, 2018 at 6:13 AM

After investigating this further it looks like the auto-failover didn't take place as quickly as expected likely due to network flakiness. I did find some strange behavior with node 0003 moving between down, active, and add back states, but eventually the right behavior happened. I'm going to discuss this behavior with Dave Finlay tomorrow, but I don't think the are any testing or operator changes needed as a result of this issue.

Simon Murray August 7, 2018 at 2:52 PM

Odd, you'll need to check the NS server logs and see why it thinks the node was in the failed add state.  Again we only report what it tells us.

Ashwin Govindarajulu August 7, 2018 at 12:26 PM

But as per the operator only, the cluster is balanced and healthy with all 9 nodes before killing the cluster pods.

 

- count: 1   eventTime: null   firstTimestamp: 2018-08-07T01:02:58Z   involvedObject:     apiVersion: couchbase.database.couchbase.com/v1     kind: CouchbaseCluster     name: test-couchbase-r5j6p     namespace: default     resourceVersion: "48381"     uid: c3d66017-99dc-11e8-a5f8-bacf8e15f9bf   lastTimestamp: 2018-08-07T01:02:58Z   message: A rebalance has been started to balance data across the cluster   metadata:     creationTimestamp: 2018-08-07T01:02:58Z     generateName: test-couchbase-r5j6p-     name: test-couchbase-r5j6p-lw9z6     namespace: default     resourceVersion: "48407"     selfLink: /api/v1/namespaces/default/events/test-couchbase-r5j6p-lw9z6     uid: a00aff97-99dd-11e8-a5f8-bacf8e15f9bf   reason: RebalanceStarted   reportingComponent: ""   reportingInstance: ""   source:     component: couchbase-operator-f79c88c9b-bfm8z   type: Normal - count: 1   eventTime: null   firstTimestamp: 2018-08-07T01:03:14Z   involvedObject:     apiVersion: couchbase.database.couchbase.com/v1     kind: CouchbaseCluster     name: test-couchbase-r5j6p     namespace: default     resourceVersion: "48408"     uid: c3d66017-99dc-11e8-a5f8-bacf8e15f9bf   lastTimestamp: 2018-08-07T01:03:14Z   message: A rebalance has completed   metadata:     creationTimestamp: 2018-08-07T01:03:14Z     generateName: test-couchbase-r5j6p-     name: test-couchbase-r5j6p-6vqkx     namespace: default     resourceVersion: "48445"     selfLink: /api/v1/namespaces/default/events/test-couchbase-r5j6p-6vqkx     uid: a9b490f1-99dd-11e8-a5f8-bacf8e15f9bf   reason: RebalanceCompleted   reportingComponent: ""   reportingInstance: ""   source:     component: couchbase-operator-f79c88c9b-bfm8z   type: Normal

 

Done
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Mike Wiederhold

Reporter

Fix versions

Affects versions

Priority

Instabug

Open Instabug

PagerDuty

Sentry

Zendesk Support

Created August 7, 2018 at 11:16 AM
Updated August 27, 2018 at 7:23 AM
Resolved August 13, 2018 at 6:30 PM
Instabug