Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-25918

RemoteClusterService Improvement Redesign



    • Improvement
    • Resolution: Fixed
    • Critical
    • None
    • 5.5.0
    • XDCR
    • None


      There are currently these MB’s filed with regards to RemoteClusterService:


      This MB is to resolve the issues above by addressing them via a newer and simpler design, knowing the common pitfalls to avoid. 


      Remote Cluster Service holds a locally cached copy of the remote cluster reference, and keeps any modification synchronized with metakv. We first take a look at the use cases:


      Write use cases (We define write use cases as any operation that causes a change locally and then attempts to update metakv):

      Remote Cluster service has the basic functionalities from UI/CLI/REST that behaves with a write operation:

      1. Adding a remote cluster reference - (addRemoteCluster)

      2. Deleting a remote cluster reference - (DelRemoteCluster)

      3. Modifying a remote cluster reference - (setRemoteCluster)


      Internally, Remote Cluster Service is potentially modified with a refresh() operation. This refresh() operation is called in the following use cases and potentially can have write operation on metakv.

      1. Periodic refreshing from Replication Manager
        Each remote cluster reference is expected to be refreshed whether or not it’s active or not.
      2. When creating a new ReplicationSpec - calls refresh()
        RemoteClusterByRefName <- ValidateNewReplicationSpec (could be run parallel) <- createAndPersistReplicationSpec <- CreateReplication
      3. When a new pipeline is constructed (XDCRFactory’s NewPipeline) - checks live remote reference status by doing a refresh
        RemoteClusterByUuid <- NewPipeline
      4. When upgrading from Couchbase < v5.0 to 5.0 or above:
      5. upgradeREmoteClusterRefs() before any replication is started (as replication manager gets started).


      In other words, the operations, other than the periodic refresh, are not to be happening often concurrently from a node perspective, and we can serialize that.


      To solve this, we propose a similar concept to the pipeline updater, called RemoteCluster Agent. Any present remote cluster reference is associated with an agent. And the agent will be in scope until the replication spec is removed. 


      The remote cluster service, in essence, will have a list of agents, with each agent responsible for providing the original purpose of a cache, with the ability to provide multiple readers with concurrent access, and also serializing the modification of cached data. We can simplify the original design but not having to worry about CAS, as we have RWMutex for controlling the agent.


      The agent itself will also be responsible for updating the metaKV. Since each agent is responsible for a single reference in a single node, each agent will be able to serialize concurrent write operation to its cache, as well as serializing the update of metaKV. The next segment will talk about coordinating multiple agents concurrently updating a single metaKV in a distributed system environment.


      RemoteCluster Agent behavior overview:


       Sample proposed remoteClusterAgent struct with minimum members to replace a cache:

      type RemoteClusterAgent struct {

         reference metadata.RemoteClusterReference

         refNodesList []string

         refMtx sync.RWMutex



      AddRemoteCluster operation:

      1. Create an Agent
      2. Agent will start
      3. Start its self monitoring
      4. Start metakvAgent (responsible for updating metakv) - serializes update at an cluster level.
      5. Realize that this start is NOT from a metakv update.
      6. Write the reference to metakv via metakvAgent

      <Stable State>



      Non-Active node -

      1. metakv callback
      2. Realizes that a new RC has been created
      3. Creates an agent from the reference
      4. Agent Start
      5. Start self-monitoring
      6. Start metakv service
      7. Realizes that this Start is from metakv update, do not re-write update.

      <Stable State>



      Active SetRemoteCluster operation:

      1. Get the agent with the reference ID.
      2. WLock
      3. Update the agent’s internal reference
      4. Update metakv
      5. WUnlock
      6. service.metadata_change_callback(refId, oldRef, newRef) -> calls metakv_change_listener.go’s remoteClusterChangeHandlerCallback


      Passive SetRemoteCluster:

      1. metakv callback (RemoteClusterServiceCallback)
      2. Finds agent for the cluster.
      3. WLock
      4. Update agent’s internal reference
      5. WUnlock


      Active DelRemoteCluster operation:

      1. Find Agent
      2. WLock
      3. Delete reference from metakv
      4. Stop Agent
      5. WUnlock
      6. Delete Agent


      Non-active DelRemoteCluster:

      1. metakv callback (RemoteClusterServiceCallback)
      2. Find Agent
      3. WLock
      4. StopAgent
      5. WUnlock
      6. Delete Agent


      Any operation without refresh():

      1. Get Agent from RemoteClusterService
      2. agent.getRef()
      3. RLock
      4. Clone
      5. RUnlock
      6. Return cloned reference


      Any operation with refresh() (i.e. RemoteClusterByRefName())

      Without update:

      1. Get Agent from RemoteClusterService
      2. agent.refresh()
      3. WLock
      4. Go through refresh code
      5. No update needed
      6. Clone
      7. WUnlock
      8. Return cloned reference


      With update:

      1. Get Agent
      2. agent.refresh()
      3. WLock
      4. Go through refresh code
      5. Figure out new Node List
      6. If hostname is not in target - go through replaceRefHostName main section
      7. Update metakv with brand new reference
      8. WUnlock


      Any further cache related reads will be purely done on an per-agent level.


      09/18/17 - Per internal discussion, the scope of the problem will not include minimizing metakv update collision below, since with the current method of picking nodes in an orderly fashion, it'll cause a small spike in metakv to do work to figure out collisions (which will be none), and then resume operation.

      Multi-nodes race conditions when updating metaKV


      Having an agent is a viable solution for solving the race condition that may happen within a single node, but we should examine scenario where a XDCR setup has >1 node that share the same Remote Cluster Reference (via metakv) and all are doing periodic refreshes.


      As of now, each node does its own refresh. In other words, each node has a “cache” (or agent, in this new scheme) that is independent from the others. Each cache/agent has a list of nodes that this cluster belongs to. Each refresh operation, should a node in the list change, will only have modifications done locally.


      We have had an issue in the past (-CBSE-3882) where many nodes in a single cluster all try to update metakv with different values, leading to conflict and ping-pong (MB-25004).-

      This issue shows up when the persisted hostname is unreachable, then the RemoteClusterService will go through its cached list of nodes and attempt to replace the host in the reference with one of the nodes in its cached list.

      This operation is perfectly fine if it’s a one-node system. However, in a multi-node cluster, since each node has its own cached list of the nodes, and each of its list is potentially in different order, thus it’s possible that each node would pick a different node and attempt to replace the reference hostname in metakv with a different node, leading to conflict and potential ping-pong.


      Another issue that this scenario shows is that we are essentially duplicating a simple operation to a single document. As the number of nodes scale in a system, the more likely we have multiple writer leading to conflict.


      The problems that we need to solve is minimize metakv update collision. The solution would need to include:

      1. Have one node be the leader in updating metakv with a new hostname.
      2. Coordinate “leader selection” without needing inter-node communication.


      To solve this, we propose the following solution.

      First, we must decouple the current refresh operation. Right now, the current refresh operation perform the followings in one pass. We will separate them into two different components.

      1. Go through the cache/agent’s list of nodes, and make sure that the node list is up to date, so that all other operations that depend on the cache/agent will be valid.
      2. Update the actual, singular copy of Remote Cluster Reference with a valid hostname, should the reference’s hostname become out of date.


      The proposal is that we should have the agent’s refresh() operation be done only with #1 above. That is more vital to the immediate and correct RemoteClusterServices operations.

      Moreover, each node can perform #1 independently without interference from one another. As long as they can perform #1, each node can continue operation.


      Updating the singular copy of the reference, will be done on a separate procedure. This procedure will need to be coordinated among all the nodes so that we achieve singular writer. Let’s call this procedure updatekv(), to be broken out of refresh().


      Updatekv() procedure:

      1. We order a list of nodes into a priority list. If a XDCR cluster has 4 nodes, then each node would be able to know itself as one of the nodes with position {0, 1, 2, 3}.
        1. This is doable without inter-node XDCR communication by having each individual node do the following:
        2. Retrieve the list of nodes currently in the cluster from any node of the cluster.
        3. Since we cannot assume that the node list will be ordered identically for all nodes in the cluster, we sort the list by:
          1. Running “GetNodeListWithMinInfo”
          2. Get a list of node names by “GetNodeNameListFromNodeList”
          3. Sort the list by alphabetical order.
        4. Each node would find itself in the list, and the index would be its priority for updating the metakv.
      2. Node with priority 0 would be responsible for updating the metakv. If Node 0 fails to update the metakv (i.e. dies), Node 1 would take over the update, and down the list it goes.
        This operation must be done without inter-node communication, and the following algorithm should provide a solution to this. The following assumes that the hostname in the reference from metakv is found to be out of date and needs replacing.
        1. Node 0’s job is to launch the updatekv() operation. It is to find a viable node, that can replace the hostname in the reference in metakv. It will perform the function of replaceRefHostName() immediately.
        2. Node 1’s job is to be the first in line should node 0 fail to update metakv. It does so by waiting  1*base.RefreshRemoteClusterRefInterval seconds. If within this time frame, it hears back from metakv listener that the reference has been updated, then it will cancel the updatekv() operation. Otherwise, once the time has expired, it will assume that node 0 has failed to update metakv, and assumes the role of a leader node and performs the update operation
        3. Node 2’s job is to be the second in line should node 0 and node 1 both fail. It does so by waiting 2*base.RefreshRemoteClusterRefInterval seconds. Similar to node 1.
        4. Etc for lower priority nodes.


      The reasons for this design is mainly to: 

      1. Keep the current system intact without introducing new inter-node messaging systems
      2. Minimize concurrent update to metakv to prevent ping-pong problems.
      3. Prevent Byzantine Failure in deciding who should update metakv, without implementing a new consensus messaging system to bring nodes to order.


      Since we know from use cases, that updating to metakv happens rarely since it is either

      1. A node that user executes that updates the metakv (i.e. deleting or updating (editing) a reference)
      2. If a hostname in the reference is moved out of the current cluster.

      When an user executes an action, that specified node will be responsible for updating the metakv.

      If scenario 2 occurs, then by having an consensually agreed list of node priorities, then the system can avoid the ping-pong problem.

      Moreover, when scenario 2 occurs, each node’s Remote Cluster Service Agent will do its duty to update its cached list of nodes to ensure functionality. The updating of metakv can occur in the background, and is not essential to the operation of XDCR. With this design, we ensure that the update will happen eventually without the risk of concurrent writes within the cluster, which is the end goal that we’re trying to solve.



        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.



              neil.huang Neil Huang
              neil.huang Neil Huang
              0 Vote for this issue
              2 Start watching this issue