Details
Description
We can see a replication spec service callback stuck:
1 @ 0x43d456 0x40a745 0x40a2fd 0x8cb227 0x8c9971 0x8b9ab3 0x97b6ee 0x97b18d 0x97a45b 0x979ed5 0x979ed6 0xcf801f 0x46cde1
|
# 0x8cb226 github.com/couchbase/goxdcr/peerToPeer.(*ReplicatorAgentImpl).SetUpdatedSpecAsync+0x86 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/peerToPeer/replicaReplicator.go:456
|
# 0x8c9970 github.com/couchbase/goxdcr/peerToPeer.(*ReplicaReplicatorImpl).HandleSpecChange+0x1d0 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/peerToPeer/replicaReplicator.go:210
|
# 0x8b9ab2 github.com/couchbase/goxdcr/peerToPeer.(*P2PManagerImpl).ReplicationSpecChangeCallback+0x1f2 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/peerToPeer/peerToPeerManager.go:600
|
# 0x97b6ed github.com/couchbase/goxdcr/metadata_svc.(*ReplicationSpecService).executeCallbackWithPriority+0x18d /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/metadata_svc/replication_spec_service.go:1293
|
# 0x97b18c github.com/couchbase/goxdcr/metadata_svc.(*ReplicationSpecService).callMetadataChangeCb+0x2ac /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/metadata_svc/replication_spec_service.go:1264
|
# 0x97a45a github.com/couchbase/goxdcr/metadata_svc.(*ReplicationSpecService).updateCacheInternal+0x3ba /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/metadata_svc/replication_spec_service.go:1161
|
# 0x979ed4 github.com/couchbase/goxdcr/metadata_svc.(*ReplicationSpecService).updateCache+0x214 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/metadata_svc/replication_spec_service.go:1125
|
# 0x979ed5 github.com/couchbase/goxdcr/metadata_svc.(*ReplicationSpecService).ReplicationSpecServiceCallback+0x215 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/metadata_svc/replication_spec_service.go:1088
|
# 0xcf801e github.com/couchbase/goxdcr/replication_manager.(*MetakvChangeListener).metakvCallback_async+0x5e /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/replication_manager/metakv_change_listener.go:97
|
It is stuck because it’s trying to send to a reload channel: https://github.com/couchbase/goxdcr/blob/cbefdb7fec3b406d9b507aef842658b598b30032/peerToPeer/replicaReplicator.go#L456
And this causes the other replication spec callback to be stuck:
16 @ 0x43d456 0x44ded3 0x44dead 0x468e05 0x483485 0x97a12e 0x97a10a 0x979ed5 0x979ed6 0xcf801f 0x46cde1
|
# 0x468e04 sync.runtime_SemacquireMutex+0x24 /home/couchbase/.cbdepscache/exploded/x86_64/go-1.18.1/go/src/runtime/sema.go:71
|
# 0x483484 sync.(*Mutex).lockSlow+0x164 /home/couchbase/.cbdepscache/exploded/x86_64/go-1.18.1/go/src/sync/mutex.go:162
|
# 0x97a12d sync.(*Mutex).Lock+0x8d /home/couchbase/.cbdepscache/exploded/x86_64/go-1.18.1/go/src/sync/mutex.go:81
|
# 0x97a109 github.com/couchbase/goxdcr/metadata_svc.(*ReplicationSpecService).updateCacheInternal+0x69 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/metadata_svc/replication_spec_service.go:1130
|
# 0x979ed4 github.com/couchbase/goxdcr/metadata_svc.(*ReplicationSpecService).updateCache+0x214 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/metadata_svc/replication_spec_service.go:1125
|
# 0x979ed5 github.com/couchbase/goxdcr/metadata_svc.(*ReplicationSpecService).ReplicationSpecServiceCallback+0x215 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/metadata_svc/replication_spec_service.go:1088
|
# 0xcf801e github.com/couchbase/goxdcr/replication_manager.(*MetakvChangeListener).metakvCallback_async+0x5e /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/replication_manager/metakv_change_listener.go:97
|
From code inspection, we can see that this means that the agent exited: https://github.com/couchbase/goxdcr/blob/cbefdb7fec3b406d9b507aef842658b598b30032/peerToPeer/replicaReplicator.go#L485-L489
And then there is nobody to listen to the reload channel: https://github.com/couchbase/goxdcr/blob/cbefdb7fec3b406d9b507aef842658b598b30032/peerToPeer/replicaReplicator.go#L509
The channel can fill up after 10 events. So someone needs to change replication settings for up to 10 times before it blocks
To reproduce
- Create a 2-node source cluster (KV for one, Analytics for another) to a 1-node target cluster.
- Create a replication.
- Change the replication setting 10 times. For me, I changed the XMEM nozzle batch size count one by one
- The 11th time changing the replication will then causes UI to freeze.
With the stack trace below showing why it froze:
1 @ 0x10003ceb6 0x10004d893 0x10004d86d 0x100068d85 0x100083d05 0x1005ad1ee 0x1005ad1ca 0x1005aa9ca 0x1005aa85e 0x100a58904 0x100a3ed6d 0x100a3ab28 0x100a3a345 0x1005b5db1 0x10006d3e1
|
# 0x100068d84 sync.runtime_SemacquireMutex+0x24 /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.18.7/go/src/runtime/sema.go:71
|
# 0x100083d04 sync.(*Mutex).lockSlow+0x164 /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.18.7/go/src/sync/mutex.go:162
|
# 0x1005ad1ed sync.(*Mutex).Lock+0x8d /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.18.7/go/src/sync/mutex.go:81
|
# 0x1005ad1c9 github.com/couchbase/goxdcr/metadata_svc.(*ReplicationSpecService).updateCacheInternal+0x69 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/metadata_svc/replication_spec_service.go:1164
|
# 0x1005aa9c9 github.com/couchbase/goxdcr/metadata_svc.(*ReplicationSpecService).setReplicationSpecInternal+0x129 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/metadata_svc/replication_spec_service.go:818
|
# 0x1005aa85d github.com/couchbase/goxdcr/metadata_svc.(*ReplicationSpecService).SetReplicationSpec+0x1d /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/metadata_svc/replication_spec_service.go:798
|
# 0x100a58903 github.com/couchbase/goxdcr/replication_manager.UpdateReplicationSettings+0x803 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/replication_manager/replication_manager.go:724
|
# 0x100a3ed6c github.com/couchbase/goxdcr/replication_manager.(*Adminport).doChangeReplicationSettingsRequest+0x5cc /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/replication_manager/adminport.go:737
|
# 0x100a3ab27 github.com/couchbase/goxdcr/replication_manager.(*Adminport).handleRequest+0x727 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/replication_manager/adminport.go:217
|
# 0x100a3a344 github.com/couchbase/goxdcr/replication_manager.(*Adminport).processRequest+0x64 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/replication_manager/adminport.go:160
|
# 0x1005b5db0 github.com/couchbase/goxdcr/gen_server.(*GenServer).run+0x350 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/gen_server/gen_server.go:103
|
To reproduce the original stack trace, create a new replication from the KV node.
And in the analytics node, we will see the following:
1 @ 0x10003ceb6 0x10004d893 0x10004d86d 0x100068d85 0x100083d05 0x1005ad1ee 0x1005ad1ca 0x1005acf95 0x1005acf96 0x100a43a1f 0x10006d3e1
|
# 0x100068d84 sync.runtime_SemacquireMutex+0x24 /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.18.7/go/src/runtime/sema.go:71
|
# 0x100083d04 sync.(*Mutex).lockSlow+0x164 /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.18.7/go/src/sync/mutex.go:162
|
# 0x1005ad1ed sync.(*Mutex).Lock+0x8d /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.18.7/go/src/sync/mutex.go:81
|
# 0x1005ad1c9 github.com/couchbase/goxdcr/metadata_svc.(*ReplicationSpecService).updateCacheInternal+0x69 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/metadata_svc/replication_spec_service.go:1164
|
# 0x1005acf94 github.com/couchbase/goxdcr/metadata_svc.(*ReplicationSpecService).updateCache+0x214 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/metadata_svc/replication_spec_service.go:1159
|
# 0x1005acf95 github.com/couchbase/goxdcr/metadata_svc.(*ReplicationSpecService).ReplicationSpecServiceCallback+0x215 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/metadata_svc/replication_spec_service.go:1122
|
# 0x100a43a1e github.com/couchbase/goxdcr/replication_manager.(*MetakvChangeListener).metakvCallback_async+0x5e /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/replication_manager/metakv_change_listener.go:97
|