Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-35335

[jepsen][Durability] Rebalance is struck when doing full recovery after hard failover with durability parameters set.

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • 6.5.0
    • None

    Description

      When running the Jepsen failover tests, I have seen the rebalance getting struck after doing full recovery after a hard failover of a node. 

      Steps to repro:

      Either run the following Jepsen tests (might need to run it a few times since the rebalance is struck only intermittently) or follow the steps provided

      Tests to run:

      1. Clone couchbase.jepsen and set up the required nodes.
      2. Run any of the following tests
        1. lein trampoline run test --nodes-file ./nodes --username root --password couchbase --workload=failover --node-count=6 --no-autofailover --replicas=1 --failover-type=hard --recovery-type=full --disrupt-count=1 --kv-timeout=1.5 --durability=0:0:100:0
        2. lein trampoline run test -nodes-file ./nodes --username root --password couchbase -workload=failover --node-count=6 --no-autofailover --replicas=3 --failover-type=hard --recovery-type=full --disrupt-count=2 --kv-timeout=30 --durability=0:100:0:0

      Or:

      1. Create a mad-hatter cluster with few nodes
      2. Start a load with durability parameters set to either Majority, persist to majority or persist to all
      3. Introduce a failure in one of the node and hard failover the node
      4. Wait for sometime and remove the failure in the failed node
      5. Do a full recovery of the node and rebalance

      Expected: Full recovery to be completed successfully 

      Actual: Rebalance while recovery is struck.

      The tests with graceful failover are passing as of now. Same with hard failover and delta recovery. It's only with hard failover and full recovery that I am seeing the rebalance being struck. 
      Also this issue is intermittent. 

       

      Tested on : 6.5.0-3883

       

      In the attached logs for one of the test, 2 nodes are failed over (172.23.105.197, 172.23.105.41) and then recovered.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              drigby Dave Rigby (Inactive)
              bharath.gp Bharath G P
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty