Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-59416

XDCR - Xmem nozzle cleanup is stuck due to waiting on non-existent bandwidth throttler

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • 7.6.0
    • 7.6.0, 7.0.0, 7.0.1, 7.0.2, 7.0.3, 7.0.4, 7.1.4, 7.0.5, 7.1.0, 7.1.1, 7.1.2, 7.2.0, 7.1.3, 7.2.1, 7.1.5, 7.2.4, 7.2.2, 7.2.3
    • XDCR
    • Triaged
    • 0
    • Unknown

    Description

      When a pipeline/replication is configured with bandwidth limit and the pipeline stops the Xmem nozzles do a cleanup. This clean-up is stuck because the writers (i.e. xmem nozzle writing to socket) wait for bandwidth throttler (referred to as only throttler henceforth) to release some capacity/quota.

      However due to pipeline stopping, the throttler goroutine also exits. So we now have a situation where the writers are waiting on a non-existent throttler. 

       

      The stacktrace for Xmem Nozzle

      goroutine profile: total 187736554 @ 0x43d376 0x44ddd3 0x44ddad 0x468d25 0x484f52 0x909e5f 0x46ce21#       0x468d24        sync.runtime_Semacquire+0x24                                            /home/couchbase/.cbdepscache/exploded/x86_64/go-1.18.5/go/src/runtime/sema.go:56#       0x484f51        sync.(*WaitGroup).Wait+0x51                                             /home/couchbase/.cbdepscache/exploded/x86_64/go-1.18.5/go/src/sync/waitgroup.go:136#       0x909e5e        github.com/couchbase/goxdcr/parts.(*XmemNozzle).finalCleanup+0x3e       /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/parts/xmem_nozzle.go:1128

      following stacktrace shows waiting on the bandwidth throttler:

      5067 @ 0x43d376 0x46901d 0x468ffd 0x48180c 0x9a7d11 0x91d3c4 0x914b8c 0x91b42b 0x904674 0x91aeaf 0x46ce21#       0x468ffc        sync.runtime_notifyListWait+0x11c                                                       /home/couchbase/.cbdepscache/exploded/x86_64/go-1.18.5/go/src/runtime/sema.go:513#       0x48180b        sync.(*Cond).Wait+0x8b                                                                  /home/couchbase/.cbdepscache/exploded/x86_64/go-1.18.5/go/src/sync/cond.go:56#       0x9a7d10        github.com/couchbase/goxdcr/pipeline_svc.(*BandwidthThrottler).Wait+0x90                /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/bandwidth_throttler.go:247#       0x91d3c3        github.com/couchbase/goxdcr/parts.(*XmemNozzle).writeToClient+0x863                     /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/parts/xmem_nozzle.go:3363#       0x914b8b        github.com/couchbase/goxdcr/parts.(*XmemNozzle).sendSingleSetMeta+0xab                  /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/parts/xmem_nozzle.go:2321#       0x91b42a        github.com/couchbase/goxdcr/parts.(*XmemNozzle).resendIfTimeout+0x46a                   /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/parts/xmem_nozzle.go:3036#       0x904673        github.com/couchbase/goxdcr/parts.(*requestBuffer).modSlot+0x53                         /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/parts/xmem_nozzle.go:294#       0x91aeae        github.com/couchbase/goxdcr/parts.(*XmemNozzle).checkAndRepairBufferMonitor+0x32e       /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/parts/xmem_nozzle.go:2980

       

      Steps to reproduce:

      1. Create replication with bandwidth usage limit
      2. Ensure that usage limit is such that the writers block all the time. The following log line will indicate such situation:
      3. 2023-11-02T10:29:30.008Z WARN GOXDCR.BwThrottler: pipelineFullTopic=13e32dab9bdeaa83cb90cbeda32d74bf/B1/B1, 13e32dab9bdeaa83cb90cbeda32d74bf/B1/B1_BandwidthThrottlerSvc went over the limit. Need cool down before more mutations can be sent. bandwidth_limit=1048576, bandwidth_usage_quota=-206079

      1. Pause the replication.
      2. Check goroutine stack trace

       

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              ayush.nayyar Ayush Nayyar
              sudeep.jathar Sudeep Jathar
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty