Multi node auto failover: Cb Cluster failover the node, but couchbase-operator didn't catch that event
Description
Environment
Release Notes Description
Attachments
Activity
Ashwin Govindarajulu August 27, 2018 at 7:23 AM
reports similar issue. So closing this ticket.
Mike Wiederhold August 13, 2018 at 6:30 PM
Network flakiness seems to have contributed to behavior that wasn't expected in a steady state use case. The operator did properly handle getting the cluster back to a stable state given the circumstance so I am going to close this as done since I don't think there are any issues that need to be fixed.
Mike Wiederhold August 9, 2018 at 6:13 AM
After investigating this further it looks like the auto-failover didn't take place as quickly as expected likely due to network flakiness. I did find some strange behavior with node 0003 moving between down, active, and add back states, but eventually the right behavior happened. I'm going to discuss this behavior with Dave Finlay tomorrow, but I don't think the are any testing or operator changes needed as a result of this issue.
Simon Murray August 7, 2018 at 2:52 PM
Odd, you'll need to check the NS server logs and see why it thinks the node was in the failed add state. Again we only report what it tells us.
Ashwin Govindarajulu August 7, 2018 at 12:26 PM
But as per the operator only, the cluster is balanced and healthy with all 9 nodes before killing the cluster pods.
TestCase: TestMultiNodeAutoFailover
Scenario:
Spawn 9 node cluster
Kill nodes 0002, 0003, 0004 so that multinode failover to happen
In the couchbase-cluster logs it says it failed over the node 0003.
But operator events didn't catch that event.
Operator events: