Loading...

XML

Word

Printable

Details

Type: Improvement
Resolution: Fixed
Priority: Major
Fix Version/s: 6.6.6, 7.0.0
Affects Version/s: Cheshire-Cat
Component/s: XDCR
Labels:
- approved-for-6.6.6
- request-dev-verify

Description

There are a few suggestions for remote cluster service that should be addressed in 7.0

syncInternalsFromStagedReference currently holds on to a Write lock while doing a RPC call to the remote node.
Multiple RefreshContext's could occur at the same time on a busy system. Refresh() call should limit this from happening instead of letting multiple ones happen.

Current Issues:

Two levels of locking – service.agentMtx to get an agent, and then agent.refMtx as the second layer.
Note that the level of locking isn’t necessarily the issue, but the first level lock is dependent upon how quickly the second level lock can return.
AgentMtx WriteLock is held under the following situations:

Adding a remote cluster
Setting a remote cluster
Deleting a remote cluster

When service.agentMtx WriteLock is held, no read can take place (i.e. unable to call GetCapability, etc).
Only when the service.agentMtx Read-Lock is held, then an agent can be found.
Once an agent is found, the agent.refMtx write lock can be held under the certain scenarios

Refresh() – (can be periodic or user induced)
SetRemoteCluster (either metakv boot up or as user induced action)
Starting a new agent (either as part of metakv boot up or user induced action)

RPC calls are embedded within the second level locking mechanism. This means that the performance of system can be dependent upon the connectivity to a target node’s ns_server, which involves network conditions and also the node’s overall load and responsiveness.
Refresh() is launched on a periodical basis via a ticker as well as a manual basis. There is no coordination between the automated effort and the manual effort. As such, it is possible for concurrent Refresh() to occur, and only one of the Refresh() instance win. It wastes system resources and induce potential and unnecessary lock contention.

Attachments

Issue Links

blocks

MB-37885 RemoteClusterSvc to quickly handle ReplicationSpecSvc requests

Closed

is triggering

MB-38672 Remote Cluster Service Refresh stuck

Closed

MB-54885 [BP 6.6.6] - Remote Cluster Service Refresh stuck

Closed

MB-38668 Return a valid error message to UI when initial refresh isn't finished

Closed

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Neil Huang

Reporter:: Neil Huang

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Due:: 05/Oct/22

Created:: 27/Feb/20 2:27 PM

Updated:: 15/Dec/22 1:45 PM

Resolved:: 05/Oct/22 5:02 PM

Gerrit Reviews

There are no open Gerrit changes

Show There are 3 closed Gerrit changes

Hide There are 3 closed Gerrit changes

MB-38106 - Implement remoteClusterSvc asynchronous RPC to remote target cluster without holding onto locks - Fixed up some management synchronization due to the relaxation of locking: Gerrit Review:

MB-38106 - Implement remoteClusterSvc asynchronous RPC to remote target cluster without holding onto locks - Fixed up some management synchronization due to the relaxation of locking: Gerrit Review:

MB-38106, locked down goxdcr for 6.6.5-MP8.: Gerrit Review:

XDCR Remote Cluster Service Cleanup

Details

Description

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty