Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-25088

Delta recovery is interrupted: stopping rebalance as we received a "try_autofailover" request

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • 5.0.0
    • 5.0.0
    • ns_server

    Description

      Builds 5.0.0-3192+.

      Setup:

      • 4 data nodes
      • 1 Couchbase bucket, 100M documents
      • Mixed workload (10K ops/sec)
      • Auto-failover timeout is 5 seconds

      Steps:

      • Load data
      • Start workload
      • Trigger graceful or hard failover after a few minutes (1 node - 172.23.99.206)
      • Add the node back
      • Trigger recovery after a few minutes

      That worked just fine until this patch:

      https://github.com/couchbase/ns_server/commit/683a6944de4fb86170dfa06973ab23d3fa493dad

      This is what we observe in our tests now:

      [user:info,2017-06-24T15:31:02.497-07:00,ns_1@172.23.99.203:<0.32134.2>:ns_rebalancer:run_graceful_failover:1323]Starting vbucket moves for graceful failover of 'ns_1@172.23.99.206'
      [user:info,2017-06-24T15:31:10.696-07:00,ns_1@172.23.99.203:<0.32134.2>:ns_rebalancer:orchestrate_failover:79]Starting failing over 'ns_1@172.23.99.206'
      [user:info,2017-06-24T15:31:10.744-07:00,ns_1@172.23.99.203:<0.32134.2>:ns_rebalancer:orchestrate_failover:82]Failed over 'ns_1@172.23.99.206': ok
      

      [user:info,2017-06-24T15:51:20.208-07:00,ns_1@172.23.99.203:<0.1274.0>:ns_orchestrator:idle:666]Starting rebalance, KeepNodes = ['ns_1@172.23.99.203','ns_1@172.23.99.204',
                                       'ns_1@172.23.99.205','ns_1@172.23.99.206'], EjectNodes = [], Failed over and being ejected nodes = [], Delta recovery nodes = ['ns_1@172.23.99.206'],  Delta recovery buckets = all
      

      [user:info,2017-06-24T15:51:25.083-07:00,ns_1@172.23.99.203:<0.1274.0>:ns_orchestrator:rebalancing:858]Stopping rebalance as we received a {try_autofailover,'ns_1@172.23.99.206'} request
      

      [user:info,2017-06-24T15:52:20.228-07:00,ns_1@172.23.99.203:<0.7517.5>:ns_rebalancer:orchestrate_failover:79]Starting failing over 'ns_1@172.23.99.206'
      [user:info,2017-06-24T15:52:20.299-07:00,ns_1@172.23.99.203:<0.7517.5>:ns_rebalancer:orchestrate_failover:82]Failed over 'ns_1@172.23.99.206': ok
      

      Basically, the node that we are trying to recover is being auto-failed over. That doesn't make a lot of sense.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              pavelpaulau Pavel Paulau (Inactive)
              pavelpaulau Pavel Paulau (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty