Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-55035

XDCR on non-KV node can freeze when replication are changed multiple times

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Morpheus
    • 7.1.0, 7.1.1, 7.1.2, 7.1.3
    • XDCR
    • Untriaged
    • 0
    • No

    Description

      We can see a replication spec service callback stuck:

      1 @ 0x43d456 0x40a745 0x40a2fd 0x8cb227 0x8c9971 0x8b9ab3 0x97b6ee 0x97b18d 0x97a45b 0x979ed5 0x979ed6 0xcf801f 0x46cde1
      #       0x8cb226        github.com/couchbase/goxdcr/peerToPeer.(*ReplicatorAgentImpl).SetUpdatedSpecAsync+0x86                  /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/peerToPeer/replicaReplicator.go:456
      #       0x8c9970        github.com/couchbase/goxdcr/peerToPeer.(*ReplicaReplicatorImpl).HandleSpecChange+0x1d0                  /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/peerToPeer/replicaReplicator.go:210
      #       0x8b9ab2        github.com/couchbase/goxdcr/peerToPeer.(*P2PManagerImpl).ReplicationSpecChangeCallback+0x1f2            /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/peerToPeer/peerToPeerManager.go:600
      #       0x97b6ed        github.com/couchbase/goxdcr/metadata_svc.(*ReplicationSpecService).executeCallbackWithPriority+0x18d    /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/metadata_svc/replication_spec_service.go:1293
      #       0x97b18c        github.com/couchbase/goxdcr/metadata_svc.(*ReplicationSpecService).callMetadataChangeCb+0x2ac           /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/metadata_svc/replication_spec_service.go:1264
      #       0x97a45a        github.com/couchbase/goxdcr/metadata_svc.(*ReplicationSpecService).updateCacheInternal+0x3ba            /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/metadata_svc/replication_spec_service.go:1161
      #       0x979ed4        github.com/couchbase/goxdcr/metadata_svc.(*ReplicationSpecService).updateCache+0x214                    /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/metadata_svc/replication_spec_service.go:1125
      #       0x979ed5        github.com/couchbase/goxdcr/metadata_svc.(*ReplicationSpecService).ReplicationSpecServiceCallback+0x215 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/metadata_svc/replication_spec_service.go:1088
      #       0xcf801e        github.com/couchbase/goxdcr/replication_manager.(*MetakvChangeListener).metakvCallback_async+0x5e       /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/replication_manager/metakv_change_listener.go:97
      

      It is stuck because it’s trying to send to a reload channel: https://github.com/couchbase/goxdcr/blob/cbefdb7fec3b406d9b507aef842658b598b30032/peerToPeer/replicaReplicator.go#L456

      And this causes the other replication spec callback to be stuck:

      16 @ 0x43d456 0x44ded3 0x44dead 0x468e05 0x483485 0x97a12e 0x97a10a 0x979ed5 0x979ed6 0xcf801f 0x46cde1
      #       0x468e04        sync.runtime_SemacquireMutex+0x24                                                                       /home/couchbase/.cbdepscache/exploded/x86_64/go-1.18.1/go/src/runtime/sema.go:71
      #       0x483484        sync.(*Mutex).lockSlow+0x164                                                                            /home/couchbase/.cbdepscache/exploded/x86_64/go-1.18.1/go/src/sync/mutex.go:162
      #       0x97a12d        sync.(*Mutex).Lock+0x8d                                                                                 /home/couchbase/.cbdepscache/exploded/x86_64/go-1.18.1/go/src/sync/mutex.go:81
      #       0x97a109        github.com/couchbase/goxdcr/metadata_svc.(*ReplicationSpecService).updateCacheInternal+0x69             /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/metadata_svc/replication_spec_service.go:1130
      #       0x979ed4        github.com/couchbase/goxdcr/metadata_svc.(*ReplicationSpecService).updateCache+0x214                    /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/metadata_svc/replication_spec_service.go:1125
      #       0x979ed5        github.com/couchbase/goxdcr/metadata_svc.(*ReplicationSpecService).ReplicationSpecServiceCallback+0x215 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/metadata_svc/replication_spec_service.go:1088
      #       0xcf801e        github.com/couchbase/goxdcr/replication_manager.(*MetakvChangeListener).metakvCallback_async+0x5e       /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/replication_manager/metakv_change_listener.go:97
      

      From code inspection, we can see that this means that the agent exited: https://github.com/couchbase/goxdcr/blob/cbefdb7fec3b406d9b507aef842658b598b30032/peerToPeer/replicaReplicator.go#L485-L489

      And then there is nobody to listen to the reload channel: https://github.com/couchbase/goxdcr/blob/cbefdb7fec3b406d9b507aef842658b598b30032/peerToPeer/replicaReplicator.go#L509

      The channel can fill up after 10 events. So someone needs to change replication settings for up to 10 times before it blocks

      To reproduce

      1. Create a 2-node source cluster (KV for one, Analytics for another) to a 1-node target cluster.
      2. Create a replication.
      3. Change the replication setting 10 times. For me, I changed the XMEM nozzle batch size count one by one
      4. The 11th time changing the replication will then causes UI to freeze.

      With the stack trace below showing why it froze:

      1 @ 0x10003ceb6 0x10004d893 0x10004d86d 0x100068d85 0x100083d05 0x1005ad1ee 0x1005ad1ca 0x1005aa9ca 0x1005aa85e 0x100a58904 0x100a3ed6d 0x100a3ab28 0x100a3a345 0x1005b5db1 0x10006d3e1
      #       0x100068d84     sync.runtime_SemacquireMutex+0x24                                                                       /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.18.7/go/src/runtime/sema.go:71
      #       0x100083d04     sync.(*Mutex).lockSlow+0x164                                                                            /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.18.7/go/src/sync/mutex.go:162
      #       0x1005ad1ed     sync.(*Mutex).Lock+0x8d                                                                                 /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.18.7/go/src/sync/mutex.go:81
      #       0x1005ad1c9     github.com/couchbase/goxdcr/metadata_svc.(*ReplicationSpecService).updateCacheInternal+0x69             /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/metadata_svc/replication_spec_service.go:1164
      #       0x1005aa9c9     github.com/couchbase/goxdcr/metadata_svc.(*ReplicationSpecService).setReplicationSpecInternal+0x129     /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/metadata_svc/replication_spec_service.go:818
      #       0x1005aa85d     github.com/couchbase/goxdcr/metadata_svc.(*ReplicationSpecService).SetReplicationSpec+0x1d              /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/metadata_svc/replication_spec_service.go:798
      #       0x100a58903     github.com/couchbase/goxdcr/replication_manager.UpdateReplicationSettings+0x803                         /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/replication_manager/replication_manager.go:724
      #       0x100a3ed6c     github.com/couchbase/goxdcr/replication_manager.(*Adminport).doChangeReplicationSettingsRequest+0x5cc   /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/replication_manager/adminport.go:737
      #       0x100a3ab27     github.com/couchbase/goxdcr/replication_manager.(*Adminport).handleRequest+0x727                        /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/replication_manager/adminport.go:217
      #       0x100a3a344     github.com/couchbase/goxdcr/replication_manager.(*Adminport).processRequest+0x64                        /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/replication_manager/adminport.go:160
      #       0x1005b5db0     github.com/couchbase/goxdcr/gen_server.(*GenServer).run+0x350                                           /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/gen_server/gen_server.go:103
      

      To reproduce the original stack trace, create a new replication from the KV node.
      And in the analytics node, we will see the following:

      1 @ 0x10003ceb6 0x10004d893 0x10004d86d 0x100068d85 0x100083d05 0x1005ad1ee 0x1005ad1ca 0x1005acf95 0x1005acf96 0x100a43a1f 0x10006d3e1
      #       0x100068d84     sync.runtime_SemacquireMutex+0x24                                                                       /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.18.7/go/src/runtime/sema.go:71
      #       0x100083d04     sync.(*Mutex).lockSlow+0x164                                                                            /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.18.7/go/src/sync/mutex.go:162
      #       0x1005ad1ed     sync.(*Mutex).Lock+0x8d                                                                                 /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.18.7/go/src/sync/mutex.go:81
      #       0x1005ad1c9     github.com/couchbase/goxdcr/metadata_svc.(*ReplicationSpecService).updateCacheInternal+0x69             /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/metadata_svc/replication_spec_service.go:1164
      #       0x1005acf94     github.com/couchbase/goxdcr/metadata_svc.(*ReplicationSpecService).updateCache+0x214                    /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/metadata_svc/replication_spec_service.go:1159
      #       0x1005acf95     github.com/couchbase/goxdcr/metadata_svc.(*ReplicationSpecService).ReplicationSpecServiceCallback+0x215 /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/metadata_svc/replication_spec_service.go:1122
      #       0x100a43a1e     github.com/couchbase/goxdcr/replication_manager.(*MetakvChangeListener).metakvCallback_async+0x5e       /Users/neil.huang/source/couchbase/goproj/src/github.com/couchbase/goxdcr/replication_manager/metakv_change_listener.go:97
      

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              ritam.sharma Ritam Sharma
              neil.huang Neil Huang
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty