Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-58696

rollback stuck waiting for slice readers for long time

    XMLWordPrintable

Details

    • Untriaged
    • 0
    • Unknown

    Description

      In a customer case, it is seen that indexer seems to performing rollback. As a part of rollback, we wait for all the slice's readers (read from/write to slice.readers channel). In the customer case, indexer seems to be stuck here and hence the rollback appears to be stuck. This causes indexer's main loop to be stuck causing other requests like drop index to timeout. This MB is to investigate what caused the index to stuck there.

       
      Attaching snippets of pprof which show the same:

      1 @ 0x43f516 0x40d6ac 0x40d118 0xdca865 0xc07023 0xc06932 0xbfb4db 0xbfacc5 0x46f141
       
      # 0xdca864 github.com/couchbase/indexing/secondary/indexer.(*metaNotifier).OnIndexDelete+0x1c4 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/indexing/secondary/indexer/cluster_manager_agent.go:825 -----> wait for indexer ch to process msg
       
      # 0xc07022 github.com/couchbase/indexing/secondary/manager.(*LifecycleMgr).DeleteIndex+0x6a2 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/indexing/secondary/manager/lifecycle.go:2243
       
      # 0xc06931 github.com/couchbase/indexing/secondary/manager.(*LifecycleMgr).handleDeleteIndex+0xd1 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/indexing/secondary/manager/lifecycle.go:2175
       
      # 0xbfb4da github.com/couchbase/indexing/secondary/manager.(*LifecycleMgr).dispatchRequest+0x2fa /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/indexing/secondary/manager/lifecycle.go:491
       
      # 0xbfacc4 github.com/couchbase/indexing/secondary/manager.(*LifecycleMgr).processRequest+0x324 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/indexing/secondary/manager/lifecycle.go:416
       
      ------------------------------------------------------
       
      1 @ 0x43f516 0x40c805 0x40c3bd 0xe080b8 0xe05e0e 0xe00dac 0xf8ebae 0x43f152 0x46f141
       
      # 0xe080b7 github.com/couchbase/indexing/secondary/indexer.(*indexer).handleWorkerMsgs+0x14d7 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/indexing/secondary/indexer/indexer.go:1373 ---> indexer main loop - rollback msg to storage mgr
       
      # 0xe05e0d github.com/couchbase/indexing/secondary/indexer.(*indexer).run+0x36d /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/indexing/secondary/indexer/indexer.go:1080
       
      # 0xe00dab github.com/couchbase/indexing/secondary/indexer.NewIndexer+0x262b /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/indexing/secondary/indexer/indexer.go:511
       
      # 0xf8ebad main.main+0xfad goproj/src/github.com/couchbase/indexing/secondary/cmd/indexer/main.go:182
       
      # 0x43f151 runtime.main+0x211 /home/couchbase/.cbdepscache/exploded/x86_64/go-1.18.7/go/src/runtime/proc.go:250
       
      --------------------------------------------------
       
      1 @ 0x43f516 0x40d6ac 0x40d0d8 0xe97490 0xf3aa43 0xf3a025 0xf39457 0xf33ec8 0xf33b99 0x46f141
       
      # 0xe9748f github.com/couchbase/indexing/secondary/indexer.(*plasmaSlice).Rollback+0xef /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/indexing/secondary/indexer/plasma_slice.go:1887 -----> slice rollback waiting for readers
       
      # 0xf3aa42 github.com/couchbase/indexing/secondary/indexer.(*storageMgr).rollbackToSnapshot+0x82 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/indexing/secondary/indexer/storage_manager.go:977
       
      # 0xf3a024 github.com/couchbase/indexing/secondary/indexer.(*storageMgr).rollbackIndex+0x1a4 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/indexing/secondary/indexer/storage_manager.go:892
       
      # 0xf39456 github.com/couchbase/indexing/secondary/indexer.(*storageMgr).handleRollback+0x476 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/indexing/secondary/indexer/storage_manager.go:802
       
      # 0xf33ec7 github.com/couchbase/indexing/secondary/indexer.(*storageMgr).handleSupvervisorCommands+0x227 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/indexing/secondary/indexer/storage_manager.go:201
       
      # 0xf33b98 github.com/couchbase/indexing/secondary/indexer.(*storageMgr).run+0x38 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/indexing/secondary/indexer/storage_manager.go:183
      ----------------------------------------------------
      

      this rollback is waiting for slice readers which could be busy for scan requests to complete but ideally scan requests should not take more than 2 mins to timeout. is it possible that while the end user's scan has timed out, we are in the middle of plasma's seek/read and aren't able to come out of it? looking at the pprof again:

      4 @ 0x450237 0x45022b 0x46b0f9 0x4852c5 0xd1ce95 0xd1ce71 0xd1ce70 0xcedbf8 0xceeaa5 0xea30dd 0xea2ea5 0xf07822 0xf072a5 0xf06045 0xeed32c 0xdadcd8 0x46f141
      #	0x46b0f8	sync.runtime_Semrelease+0x18								/home/couchbase/.cbdepscache/exploded/x86_64/go-1.18.7/go/src/runtime/sema.go:66
      #	0x4852c4	sync.(*Mutex).unlockSlow+0x64								/home/couchbase/.cbdepscache/exploded/x86_64/go-1.18.7/go/src/sync/mutex.go:248
      #	0xd1ce94	sync.(*Mutex).Unlock+0xd4								/home/couchbase/.cbdepscache/exploded/x86_64/go-1.18.7/go/src/sync/mutex.go:214
      #	0xd1ce70	github.com/couchbase/plasma.(*Plasma).releaseClockHandle+0xb0				/home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/plasma/swapper.go:188
      #	0xd1ce6f	github.com/couchbase/plasma.(*Plasma).tryEvictPagesOnce+0xaf				/home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/plasma/swapper.go:269
      #	0xcedbf7	github.com/couchbase/plasma.(*wCtx).tryThrottleForMemory+0x117				/home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/plasma/plasma.go:1749
      #	0xceeaa4	github.com/couchbase/plasma.(*Reader).NewSnapshotIterator+0xa4				/home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/plasma/reader.go:56
      #	0xea30dc	github.com/couchbase/indexing/secondary/indexer.(*plasmaSnapshot).Iterate+0x13c		/home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/indexing/secondary/indexer/plasma_slice.go:2966
      #	0xea2ea4	github.com/couchbase/indexing/secondary/indexer.(*plasmaSnapshot).Range+0x64		/home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/indexing/secondary/indexer/plasma_slice.go:2939
      #	0xf07821	github.com/couchbase/indexing/secondary/indexer.scanSingleSlice+0x3a1			/home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/indexing/secondary/indexer/scan_scatter.go:206
      #	0xf072a4	github.com/couchbase/indexing/secondary/indexer.scanOne+0x124				/home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/indexing/secondary/indexer/scan_scatter.go:148
      #	0xf06044	github.com/couchbase/indexing/secondary/indexer.scatter+0x84				/home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/indexing/secondary/indexer/scan_scatter.go:55
      #	0xeed32b	github.com/couchbase/indexing/secondary/indexer.(*IndexScanSource).Routine+0xb4b	/home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/indexing/secondary/indexer/scan_pipeline.go:339
      #	0xdadcd7	github.com/couchbase/indexing/secondary/pipeline.(*Pipeline).runIt.func1+0x37		/home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/indexing/secondary/pipeline/pipeline.go:75
      

      here we can see an instance of scanSingle trying to evict page but not sure if this can get stuck (for more than 3 hours).
       

      Attachments

        For Gerrit Dashboard: MB-58696
        # Subject Branch Project Status CR V

        Activity

          People

            dhruvil.ketanshah Dhruvil Shah
            dhruvil.ketanshah Dhruvil Shah
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:

              Gerrit Reviews

                There are 2 open Gerrit changes

                PagerDuty