There are a few suggestions for remote cluster service that should be addressed in 7.0
- syncInternalsFromStagedReference currently holds on to a Write lock while doing a RPC call to the remote node.
- Multiple RefreshContext's could occur at the same time on a busy system. Refresh() call should limit this from happening instead of letting multiple ones happen.
- Two levels of locking – service.agentMtx to get an agent, and then agent.refMtx as the second layer.
Note that the level of locking isn’t necessarily the issue, but the first level lock is dependent upon how quickly the second level lock can return.
AgentMtx WriteLock is held under the following situations:
- Adding a remote cluster
- Setting a remote cluster
- Deleting a remote cluster
- When service.agentMtx WriteLock is held, no read can take place (i.e. unable to call GetCapability, etc).
Only when the service.agentMtx Read-Lock is held, then an agent can be found.
Once an agent is found, the agent.refMtx write lock can be held under the certain scenarios
- Refresh() – (can be periodic or user induced)
- SetRemoteCluster (either metakv boot up or as user induced action)
- Starting a new agent (either as part of metakv boot up or user induced action)
- RPC calls are embedded within the second level locking mechanism. This means that the performance of system can be dependent upon the connectivity to a target node’s ns_server, which involves network conditions and also the node’s overall load and responsiveness.
- Refresh() is launched on a periodical basis via a ticker as well as a manual basis. There is no coordination between the automated effort and the manual effort. As such, it is possible for concurrent Refresh() to occur, and only one of the Refresh() instance win. It wastes system resources and induce potential and unnecessary lock contention.