Multi node auto failover: Cb Cluster failover the node, but couchbase-operator didn't catch that event

Description

TestCase: TestMultiNodeAutoFailover

Scenario:

Spawn 9 node cluster
Kill nodes 0002, 0003, 0004 so that multinode failover to happen

In the couchbase-cluster logs it says it failed over the node 0003.

But operator events didn't catch that event.

Operator events:

Environment

None

Release Notes Description

None

Attachments

Activity

Show:

Ashwin Govindarajulu August 27, 2018 at 7:23 AM

reports similar issue. So closing this ticket.

Mike Wiederhold August 13, 2018 at 6:30 PM

Network flakiness seems to have contributed to behavior that wasn't expected in a steady state use case. The operator did properly handle getting the cluster back to a stable state given the circumstance so I am going to close this as done since I don't think there are any issues that need to be fixed.

Mike Wiederhold August 9, 2018 at 6:13 AM

After investigating this further it looks like the auto-failover didn't take place as quickly as expected likely due to network flakiness. I did find some strange behavior with node 0003 moving between down, active, and add back states, but eventually the right behavior happened. I'm going to discuss this behavior with Dave Finlay tomorrow, but I don't think the are any testing or operator changes needed as a result of this issue.

Simon Murray August 7, 2018 at 2:52 PM

Odd, you'll need to check the NS server logs and see why it thinks the node was in the failed add state. Again we only report what it tells us.

Ashwin Govindarajulu August 7, 2018 at 12:26 PM

But as per the operator only, the cluster is balanced and healthy with all 9 nodes before killing the cluster pods.

Done

Details

Assignee

Mike Wiederhold

Reporter

Ashwin Govindarajulu

Labels

functional-test

Components

Fix versions

1.0.0

Affects versions

1.0.0

Priority

Major

Instabug

Open Instabug

PagerDuty

Sentry

Zendesk Support

Created August 7, 2018 at 11:16 AM

Updated August 27, 2018 at 7:23 AM

Resolved August 13, 2018 at 6:30 PM

Configure