Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-38106

XDCR Remote Cluster Service Cleanup

    XMLWordPrintable

Details

    Description

      There are a few suggestions for remote cluster service that should be addressed in 7.0

      1. syncInternalsFromStagedReference currently holds on to a Write lock while doing a RPC call to the remote node.
      2. Multiple RefreshContext's could occur at the same time on a busy system. Refresh() call should limit this from happening instead of letting multiple ones happen.

       

      Current Issues:

      1. Two levels of locking – service.agentMtx to get an agent, and then agent.refMtx as the second layer.
        Note that the level of locking isn’t necessarily the issue, but the first level lock is dependent upon how quickly the second level lock can return.
        AgentMtx WriteLock is held under the following situations:
      • Adding a remote cluster
      • Setting a remote cluster
      • Deleting a remote cluster
      1. When service.agentMtx WriteLock is held, no read can take place (i.e. unable to call GetCapability, etc).
        Only when the service.agentMtx Read-Lock is held, then an agent can be found.
        Once an agent is found, the agent.refMtx write lock can be held under the certain scenarios
      • Refresh() – (can be periodic or user induced)
      • SetRemoteCluster (either metakv boot up or as user induced action)
      • Starting a new agent (either as part of metakv boot up or user induced action)
      1. RPC calls are embedded within the second level locking mechanism. This means that the performance of system can be dependent upon the connectivity to a target node’s ns_server, which involves network conditions and also the node’s overall load and responsiveness.
      2. Refresh() is launched on a periodical basis via a ticker as well as a manual basis. There is no coordination between the automated effort and the manual effort. As such, it is possible for concurrent Refresh() to occur, and only one of the Refresh() instance win. It wastes system resources and induce potential and unnecessary lock contention.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            neil.huang Neil Huang added a comment -

            System Behavior Changes:

             

            To address the above situations, the following will be the proposed system behavior.

             

            1. Adding a remote cluster (getOrStartNewAgent):
            2. When an add remote cluster action takes place, an agent is created and added to the service’s agent map. The service agent’s map is unlocked before the agent.Start() is called.
            3. The agent at this point is considered “not yet initialized”. Because the service agent’s map is unlocked, concurrent requests are possible:
            4. Any Set/Refresh operations on this agent will return an error.
            5. Any Delete/Stop operation on this agent is allowed.
            1. Note that the first-level lock is not held. This allows other agents to be accessed. Allowing agent to have a “not yet initialized” state is what allows the first-level lock to be unlocked and be safe compared to the previous design.
            1. The agent is then Start()’ed. Agent.Start() is called under two scenarios:
            2. UserInitiated - AddRemoteCluster() is called when user issues a command via REST.
            3. This function call will cause the agent.Start() to be synchronous. The agent will go ahead and do the RPC and populate the information. Any errors with RPC will be returned to the caller of agent.Start() and returned back to REST.
            4. If the whole operation is successful, the agent is now considered “Initialized”. Set/Refresh operations are now allowed.
            5. If the agent had issues writing the reference to metakv, the agent is removed from the service map and another add can take place.
            6. If the agent had any issues that is not related to metakv, the agent is still considered “initialized” so that Refresh() will be enabled, and Refresh() will repair any issues.
            1. Non-UserInitiated - When a node starts up and loads the reference from metakv.
            2. The Start() call will be asynchronous. This means that the agent.Start() will set up the agent’s framework and return, while executing the RPC calls in the background. This is because the agent must exist in the metakv boot-up case. Any errors with RPC will have to be re-tried with the periodic refresh mechanism. This is the same design as when the RCS has issues detecting user’s intent at boot-up… let the Refresh() take care of retrying RPC calls.
            3. Once the first initial RPC call initiated by Start() has finished, Set/Refresh ops will be allowed. Note that Delete() is always allowed (and should win) during Start() calls.
            1. Refresh()-ing a remote cluster:
            2. The first Refresh() call: As normal, periodically, a Refresh() call will be issued. The Refresh() will execute as normal but watch for abort signals.
            3. If a concurrent Refresh() is called while the first one is running, the second Refresh() call will simply wait until the first Refresh() is finished executing and return the error result from the first one.
            4. This is different from previous design where multiple refresh() could happen at the same time. It still allows concurrent Refresh() calls, but only one actual Refresh() is executed (with one RPC) and the result is distributed back to the callers. This should ensure consistent system behavior and allow the system to be more light-weight.
            1. The changes here is that a Refresh() call can be interrupted by an user action (such as a SetRemoteCluster). An interrupted Refresh() call will not be able to update the agent’s reference. The design decision is that a user’s set (i.e. updating the hostname) should take priority.
            2. Likewise, If a user’s SetRemoteCluster() is underway, Refresh() should be prevented to execute.
            1. SetRemoteCluster - A setRemoteCluster must win over any automatic Refresh()
            2. UserInitiated - SetRemoteCluster() call will try to Abort any ongoing Refresh() before actually running the RPC verification with the SetRemoteCluster() call. If a set call fails, the failure should be due to the actual set operation, and not because a Refresh() is ongoing.
            3. Non-UserInitiated - this is when metakv callback occurs. This call should always win over any Refresh() as well, and this path will ensure that AbortRefresh() is called before it tries to set.
            4. When a set is under way, any Refresh() must be denied.
            1. DelRemoteCluster - Deletion path should always win, even before Add() has finished initializing.
            neil.huang Neil Huang added a comment - System Behavior Changes:   To address the above situations, the following will be the proposed system behavior.   Adding a remote cluster (getOrStartNewAgent): When an add remote cluster action takes place, an agent is created and added to the service’s agent map. The service agent’s map is unlocked before the agent.Start() is called. The agent at this point is considered “not yet initialized”. Because the service agent’s map is unlocked, concurrent requests are possible: Any Set/Refresh operations on this agent will return an error. Any Delete/Stop operation on this agent is allowed. Note that the first-level lock is not held. This allows other agents to be accessed. Allowing agent to have a “not yet initialized” state is what allows the first-level lock to be unlocked and be safe compared to the previous design. The agent is then Start()’ed. Agent.Start() is called under two scenarios: UserInitiated - AddRemoteCluster() is called when user issues a command via REST. This function call will cause the agent.Start() to be synchronous. The agent will go ahead and do the RPC and populate the information. Any errors with RPC will be returned to the caller of agent.Start() and returned back to REST. If the whole operation is successful, the agent is now considered “Initialized”. Set/Refresh operations are now allowed. If the agent had issues writing the reference to metakv, the agent is removed from the service map and another add can take place. If the agent had any issues that is not related to metakv, the agent is still considered “initialized” so that Refresh() will be enabled, and Refresh() will repair any issues. Non-UserInitiated - When a node starts up and loads the reference from metakv. The Start() call will be asynchronous. This means that the agent.Start() will set up the agent’s framework and return, while executing the RPC calls in the background. This is because the agent must exist in the metakv boot-up case. Any errors with RPC will have to be re-tried with the periodic refresh mechanism. This is the same design as when the RCS has issues detecting user’s intent at boot-up… let the Refresh() take care of retrying RPC calls. Once the first initial RPC call initiated by Start() has finished, Set/Refresh ops will be allowed. Note that Delete() is always allowed (and should win) during Start() calls. Refresh()-ing a remote cluster: The first Refresh() call: As normal, periodically, a Refresh() call will be issued. The Refresh() will execute as normal but watch for abort signals. If a concurrent Refresh() is called while the first one is running, the second Refresh() call will simply wait until the first Refresh() is finished executing and return the error result from the first one. This is different from previous design where multiple refresh() could happen at the same time. It still allows concurrent Refresh() calls, but only one actual Refresh() is executed (with one RPC) and the result is distributed back to the callers. This should ensure consistent system behavior and allow the system to be more light-weight. The changes here is that a Refresh() call can be interrupted by an user action (such as a SetRemoteCluster). An interrupted Refresh() call will not be able to update the agent’s reference. The design decision is that a user’s set (i.e. updating the hostname) should take priority. Likewise, If a user’s SetRemoteCluster() is underway, Refresh() should be prevented to execute. SetRemoteCluster - A setRemoteCluster must win over any automatic Refresh() UserInitiated - SetRemoteCluster() call will try to Abort any ongoing Refresh() before actually running the RPC verification with the SetRemoteCluster() call. If a set call fails, the failure should be due to the actual set operation, and not because a Refresh() is ongoing. Non-UserInitiated - this is when metakv callback occurs. This call should always win over any Refresh() as well, and this path will ensure that AbortRefresh() is called before it tries to set. When a set is under way, any Refresh() must be denied. DelRemoteCluster - Deletion path should always win, even before Add() has finished initializing.

            Build couchbase-server-7.0.0-1614 contains goxdcr commit 3b9d94b with commit message:
            MB-38106 - Implement remoteClusterSvc asynchronous RPC to remote target cluster without holding onto locks

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-1614 contains goxdcr commit 3b9d94b with commit message: MB-38106 - Implement remoteClusterSvc asynchronous RPC to remote target cluster without holding onto locks
            wayne Wayne Siu added a comment -

            Re-opening it for 6.6.6 (6.6.5-MP8).

            wayne Wayne Siu added a comment - Re-opening it for 6.6.6 (6.6.5-MP8).
            neil.huang Neil Huang added a comment -

            Wayne Siu - why wouldn't there be a BP issue specifically targeted for the MP? That's usually the process, and this MB remains closed.

            neil.huang Neil Huang added a comment - Wayne Siu - why wouldn't there be a BP issue specifically targeted for the MP? That's usually the process, and this MB remains closed.
            ritam.sharma Ritam Sharma added a comment - - edited

            Neil Huang - This ticket has been closed earlier as 'request-dev-verify', please help with validation for MP8 as well.
            QE will make sure that regression is green.

            ritam.sharma Ritam Sharma added a comment - - edited Neil Huang - This ticket has been closed earlier as 'request-dev-verify', please help with validation for MP8 as well. QE will make sure that regression is green.
            neil.huang Neil Huang added a comment -

            Changes have been verified on dev environment using vagrant VMs for testing, rolling upgrade from 6.6.5 to 6.6.5+changeset, as well as rebalancing in+out target nodes to ensure remote cluster service refreshes and updates target node list as expected.

            Changes have been checked into mad-hatter branch. The ticket will be marked closed once the build is completed.

            neil.huang Neil Huang added a comment - Changes have been verified on dev environment using vagrant VMs for testing, rolling upgrade from 6.6.5 to 6.6.5+changeset, as well as rebalancing in+out target nodes to ensure remote cluster service refreshes and updates target node list as expected. Changes have been checked into mad-hatter branch. The ticket will be marked closed once the build is completed.

            Build couchbase-server-6.6.5-10113 contains goxdcr commit 473eb59 with commit message:
            MB-38106 - Implement remoteClusterSvc asynchronous RPC to remote target cluster without holding onto locks

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.6.5-10113 contains goxdcr commit 473eb59 with commit message: MB-38106 - Implement remoteClusterSvc asynchronous RPC to remote target cluster without holding onto locks

            Build couchbase-server-6.6.6-10503 contains goxdcr commit 473eb59 with commit message:
            MB-38106 - Implement remoteClusterSvc asynchronous RPC to remote target cluster without holding onto locks

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.6.6-10503 contains goxdcr commit 473eb59 with commit message: MB-38106 - Implement remoteClusterSvc asynchronous RPC to remote target cluster without holding onto locks

            People

              neil.huang Neil Huang
              neil.huang Neil Huang
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty