Details
-
Bug
-
Resolution: Duplicate
-
Critical
-
None
-
6.5.0
-
None
-
Triaged
-
No
-
KV-Engine MH 2nd Beta
Description
When running the Jepsen failover tests, I have seen the rebalance getting struck after doing full recovery after a hard failover of a node.
Steps to repro:
Either run the following Jepsen tests (might need to run it a few times since the rebalance is struck only intermittently) or follow the steps provided
Tests to run:
- Clone couchbase.jepsen and set up the required nodes.
- Run any of the following tests
- lein trampoline run test --nodes-file ./nodes --username root --password couchbase --workload=failover --node-count=6 --no-autofailover --replicas=1 --failover-type=hard --recovery-type=full --disrupt-count=1 --kv-timeout=1.5 --durability=0:0:100:0
- lein trampoline run test -
nodes-file ./nodes --username root --password couchbase-workload=failover --node-count=6 --no-autofailover --replicas=3 --failover-type=hard --recovery-type=full --disrupt-count=2 --kv-timeout=30 --durability=0:100:0:0
Or:
- Create a mad-hatter cluster with few nodes
- Start a load with durability parameters set to either Majority, persist to majority or persist to all
- Introduce a failure in one of the node and hard failover the node
- Wait for sometime and remove the failure in the failed node
- Do a full recovery of the node and rebalance
Expected: Full recovery to be completed successfully
Actual: Rebalance while recovery is struck.
The tests with graceful failover are passing as of now. Same with hard failover and delta recovery. It's only with hard failover and full recovery that I am seeing the rebalance being struck.
Also this issue is intermittent.
Tested on : 6.5.0-3883
In the attached logs for one of the test, 2 nodes are failed over (172.23.105.197, 172.23.105.41) and then recovered.
Attachments
Issue Links
- duplicates
-
MB-35633 Aborts received by replica should not be ignored when matching prepare is missing
- Closed