Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-9979

Delta node recovery after failover: a failed node should be able to catch up instead of being considered a new node

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Fixed
    • Major
    • 3.0
    • 2.5.0
    • ns_server
    • Security Level: Public
    • all couchbase severs
    • Sprint 1 - Jan27 - Feb14

    Description

      When a node fails over due to an unexpected error goes into a PEND state, either auto failover or manual failover causes all the 1st replicas for vBuckets managed on that node to be promoted to active. At this point, typically the failed node is "removed" and another node is "added" and a user rebalances the cluster to recreate the missing replicas and also populate the active vBuckets on the new node that is added.

      If the failed node is first "removed" followed by a rebalance and then "added" back in, it will be considered a completely new node and all prior data is remove instead of potentially being reused.

      This becomes a huge issue when data sizes a node is managing are very large. Transferring data back and forth even if it has been barely updated is extremely inefficient. If failover is not used and the node is brought back in without failing over, there is unavailability of data. So its is not an option in most cases.

      We need to improve scenario in a way that rebalance is able to catch up from where the failed node left off from the new active as opposed to transferring all the data over from scratch.

      The failed node shouldn't have to be "removed" and "added" back in. There should be an option for the new node to be rebalanced into the cluster by remove it from the pending remove state.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              parag Parag Agarwal (Inactive)
              cihan Cihan Biyikoglu (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty