Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-31366

delta recovery and autofailover do not play nicely together

    XMLWordPrintable

Details

    • Untriaged
    • Unknown

    Description

      Setup steps:

      • Build out a 3-node cluster using 5.5.1
      • Enable autofailover with a 10 second timeout for up to 1 event
      • Add two buckets with 1 replica each. They can be empty

      The goal here is to setup a cluster where a rebalance operation takes longer than the autofailover timeout. In our case we used physical machines with HDD storage. Rebalance operations would take about 40 seconds which is fast enough for humans to iterate over a bunch of them but slow enough to be longer than the 10 second autofailover timeout.

      If you try to reproduce this in a VM environment or with really fast storage it may be necessary to increase the bucket count or take other measures to ensure the rebalance takes the right amount of time

       

      With the above setup, here's how to reproduce

      1. Pick a node and do a graceful failover
      2. Mark that node for add back with delta recovery
      3. rebalance

      Expected results:

      • Node rebalances back into the cluster cleanly and without warning or other side effects or unexpected behavior

      Actual results:

      • The message "Could not automatically fail over nodes (['ns_1@lca1-app0541.stg.linkedin.com']). Rebalance is running" is displayed in the Logs page in the webui 10 seconds into the delta-recovery rebalance
      • If the rebalance completes successfully, no further symptoms are exhibited

      However! We can setup another test that is worse. Follow the above repro steps of doing graceful failout and add-back with delta recovery. But this time interrupt the rebalance partway through once 10 seconds have elapsed. In this case we'd expect the node to still be a member of the cluster but the cluster is still in an unbalanced state. However the actual behavior is that the node is immediately auto-failovered out of the cluster again.

      The above seems like very wrong behavior.

       

      I have come up with some ways to fail a node out and bring it back to the cluster without triggering the above condition. Any one will work:

      • Use full recovery instead of delta
      • Disable auto-failover and then do delta recovery
      • Set the auto-failover timeout really high (higher than the time a rebalance takes to complete) and then do delta recovery
      • Leave autofailover enabled and with a low setting (10 seconds) and do a delta recovery. But then abort it after just a couple seconds (before the autofailover timeout threshold is crossed). Then do a subsequent rebalance again which exhibits no strange behavior

      We are about to embark on upgrading very many of our production couchbase clusters using a graceful failover and delta recovery method and would rather this behavior be fixed instead of needing to work around it in our automation and tooling

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            Balakumaran.Gopal Balakumaran Gopal
            bweir bweir
            Votes:
            1 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty