Uploaded image for project: 'Couchbase Documentation'
  1. Couchbase Documentation
  2. DOC-8449

FTS - Improve rebalance/failover documentation for the search service.

    XMLWordPrintable

Details

    • 1

    Description

      Need to give more context about the service internals/impact during rebalance. 

      -how the partitions are handled during a rebalance

      -any impacts on the live traffic updates.

      -tips for faster rebalance operations.

      how to enable the quick recovery of partitions during failover upgrade - recovery of a node.

      -how to deal with rebalance failures.

      -Is it safe/recommended to retry a failed rebalance.

      -things to keep in mind before doing a rebalance.

      -things to keep keep track of during a rebalance.

       

       

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - - edited

          Rebalance => The Search Service maintains a cluster-wide set of index definitions and metadata, which allows the redistribution of indexes and index replicas during a rebalance. During a rebalance operation, the search service redistributes the index partitions across the available search service nodes for a balanced partition-node assignment.

          The newly assigned index partitions are built afresh over DCP feed on the new nodes. And once the new partitions are build up to the current / latest sequence numbers, then they are promoted to take the live traffic and the older partitions are deleted from the system. The live traffic is never functionally affected. Nevertheless, the performance impacts of a concurrent rebalance on a live cluster can't be fully ruled out.

          As the rebalance is a resource-consuming cluster management operation, it's always recommended to perform rebalance during off-peak hours. 

           

          How to speed up the rebalance/Tips for faster rebalance =>

           Search service moves or builds the index partitions one at a time per node during the rebalance operations. This could significantly increase the overall time taken for the rebalance operation.

          One way to speed up the rebalance operation is to enable the movement of partitions parallelly in a configurable way.

          There is a configurable option [maxConcurrentPartitionMovesPerNode] to bring the additional concurrency to the way we move/build partitions during a rebalance operation. 

          If we override this parameter (maxConcurrentPartitionMovesPerNode to N) as a runtime cluster option then we could concurrently build that many partitions in parallel per node at a time and the rebalance ought to complete faster.

           

          How to configure this `maxConcurrentPartitionMovesPerNode` in a cluster in CC?

          Use the update endpoint for  manager options.

           

          curl -XPUT -uAdministrator:asdasd http://<nodeIP>:8094/api/managerOptions -d ' {| |"maxConcurrentPartitionMovesPerNode":"5"}'

            

          How to check the current value for `maxConcurrentPartitionMovesPerNode` in a cluster?

           

          curl -XGET -uAdministrator:asdasd http://<nodeIP>:8094/api/manager

           

          Please keep in mind that, when multiple partitions are built in parallel, it needs more RAM and hence mandates a higher RAM quota. 

          As the rebalance operations consumes resources, it is always advisable to plan the rebalance operations during non-peak hours. 

           

           

          Failovers => During failover of Search service nodes, there is no partition movement and hence the failover-rebalance is instantaneous. Search service promotes the replica index partitions to primary so that those serve the live cluster traffic instantly. 

          Failover and recovery rebalance could be used by users for applying patches or upgrading software/hardware for a shorter duration. During a strict recovery rebalance operation (no extra node additions/removals), the index partitions residing on the recovered node would be reused. And this ensures a quick recovery rebalance operation.

          So the usual failover-recovery steps would be like,

          1. Failover the node which needs quick software or hardware maintenance. 
          2. With replica partitions, live traffic is served seamlessly.
          3. Perform the software/hardware maintenance operation.
          4. Perform recover rebalance operation.
          5. Cluster is back to normal/pre-failover safe state.

           

           

           

           

          Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - - edited Rebalance => The Search Service maintains a cluster-wide set of index definitions and metadata, which allows the redistribution of indexes and index replicas during a rebalance. During a rebalance operation, the search service redistributes the index partitions across the available search service nodes for a balanced partition-node assignment. The newly assigned index partitions are built afresh over DCP feed on the new nodes. And once the new partitions are build up to the current / latest sequence numbers, then they are promoted to take the live traffic and the older partitions are deleted from the system. The live traffic is never functionally affected. Nevertheless, the performance impacts of a concurrent rebalance on a live cluster can't be fully ruled out. As the rebalance is a resource-consuming cluster management operation, it's always recommended to perform rebalance during off-peak hours.    How to speed up the rebalance/Tips for faster rebalance =>  Search service moves or builds the index partitions one at a time per node during the rebalance operations. This could significantly increase the overall time taken for the rebalance operation. One way to speed up the rebalance operation is to enable the movement of partitions parallelly in a configurable way. There is a configurable option  [maxConcurrentPartitionMovesPerNode]  to bring the additional concurrency to the way we move/build partitions during a rebalance operation.  If we override this parameter (maxConcurrentPartitionMovesPerNode to N) as a runtime cluster option then we could concurrently build that  many partitions in parallel per node  at a time and the rebalance ought to complete faster.   How to configure this `maxConcurrentPartitionMovesPerNode` in a cluster in CC? Use the update endpoint for  manager options.   curl -XPUT -uAdministrator:asdasd http://<nodeIP>:8094/api/managerOptions -d ' {| |"maxConcurrentPartitionMovesPerNode":"5"}'    How to check the current value for `maxConcurrentPartitionMovesPerNode` in a cluster?   curl -XGET -uAdministrator:asdasd http://<nodeIP>:8094/api/manager   Please keep in mind that, when multiple partitions are built in parallel, it needs more RAM and hence mandates a higher RAM quota.  As the rebalance operations consumes resources, it is always advisable to plan the rebalance operations during non-peak hours.      Failovers => During failover of Search service nodes, there is no partition movement and hence the failover-rebalance is instantaneous. Search service promotes the replica index partitions to primary so that those serve the live cluster traffic instantly.  Failover and recovery rebalance could be used by users for applying patches or upgrading software/hardware for a shorter duration. During a strict recovery rebalance operation (no extra node additions/removals), the index partitions residing on the recovered node would be reused. And this ensures a quick recovery rebalance operation. So the usual failover-recovery steps would be like, Failover the node which needs quick software or hardware maintenance.  With replica partitions, live traffic is served seamlessly. Perform the software/hardware maintenance operation. Perform recover rebalance operation. Cluster is back to normal/pre-failover safe state.        

          People

            amarantha.kulkarni Amarantha Kulkarni (Inactive)
            Sreekanth Sivasankaran Sreekanth Sivasankaran
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty