Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-60006

[BP 7.2.4] - XDCR - CheckpointsServiceCacheImpl.InvalidateCache might block when its Run function has exited

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • 7.2.4
    • 7.1.4, 7.1.0, 7.1.1, 7.1.2, 7.2.0, 7.1.3, 7.2.1, 7.1.5, 7.2.2
    • XDCR
    • Untriaged
    • 0
    • Unknown

    Description

      When the pipeline is stopping, execution might be blocked on DelCheckpointDocs to finish:

      1 @ 0x43d456 0x44ded3 0x44dead 0x468ce5 0x484f12 0x9364c5 0x9f7c94 0x9ff18f 0x483822 0x9fe307 0x9fe2d5 0x46cde1
      #       0x468ce4        sync.runtime_Semacquire+0x24                                                            /home/couchbase/.cbdepscache/exploded/x86_64/go-1.18.1/go/src/runtime/sema.go:56
      #       0x484f11        sync.(*WaitGroup).Wait+0x51                                                             /home/couchbase/.cbdepscache/exploded/x86_64/go-1.18.1/go/src/sync/waitgroup.go:136
      #       0x9364c4        github.com/couchbase/goxdcr/metadata_svc.(*CheckpointsService).DelCheckpointsDocs+0x384 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/metadata_svc/checkpoints_service.go:209
      #       0x9f7c93        github.com/couchbase/goxdcr/pipeline_manager.(*PipelineManager).StopPipeline+0x7f3      /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/pipeline_manager/pipeline_manager.go:648
      #       0x9ff18e        github.com/couchbase/goxdcr/pipeline_manager.(*PipelineUpdater).run.func1+0xe4e         /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/pipeline_manager/pipeline_manager.go:1538
      #       0x483821        sync.(*Once).doSlow+0xc1                                                                /home/couchbase/.cbdepscache/exploded/x86_64/go-1.18.1/go/src/sync/once.go:68
      #       0x9fe306        sync.(*Once).Do+0x46                                                                    /home/couchbase/.cbdepscache/exploded/x86_64/go-1.18.1/go/src/sync/once.go:59
      #       0x9fe2d4        github.com/couchbase/goxdcr/pipeline_manager.(*PipelineUpdater).run+0x14                /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/pipeline_manager/pipeline_manager.go:1523 

      Execution can actually be stuck in InvalidateCache() which a part of DelCheckpointDocs:

      1 @ 0x43d456 0x40b5ec 0x40b018 0x93472b 0x936b33 0x46cde1
      #       0x93472a        github.com/couchbase/goxdcr/metadata_svc.(*CheckpointsServiceCacheImpl).InvalidateCache+0x8a    /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/metadata_svc/checkpointsServiceCache.go:181
      #       0x936b32        github.com/couchbase/goxdcr/metadata_svc.(*CheckpointsService).DelCheckpointsDocs.func1+0x72    /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/metadata_svc/checkpoints_service.go:179 

      Investigating, the code of InvalidateCache:

      77 func (c *CheckpointsServiceCacheImpl) InvalidateCache() {
      178 	req := NewInvalidateReq()
      179 	select {
      180 	case c.externalInvalidateCh <- req:
      181 		<-req.invalidatedAck
      182 		return
      183 	}
      184 }

      The channel externalInvalidateCh is a buffered channel of size 10, so request goes into the channel and we see that execution is stuck at line 181, waiting to be read from invalidatedAck which is an unbuffered channel, meaning it'll block until there is a writer.

      Looking in places where there is something written to invalidatedAck,

      216 func (c *CheckpointsServiceCacheImpl) Run() {
      217 	for {
      218 		select {
      219 		case <-c.finCh:
      220 			return
                      ...
      317 		case oneInvalidateReq := <-c.externalInvalidateCh:
      318 			c.requestInvalidateCache()
      319 			close(oneInvalidateReq.invalidatedAck)
                      ...
      323 		}
      324 	}
      325 } 

      Line 319 should have unblocked InvalidateCache. But there can be cases where c.finCh on line 219 can be called, it is called via CheckpointsService.ReplicationSpecChangeCallback which is one of the many metadataChangeCallbacks executed concurrently. 

      So in summary, CheckpointsService.ReplicationSpecChangeCallback might race ahead of 
      ReplicationSpecChangeListener.replicationSpecChangeHandlerCallback, which might cause InvalidateCache() be executed when Run() has already exited.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              ayush.nayyar Ayush Nayyar
              sumukh.bhat Sumukh Bhat
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty