Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-32044

cbdatasource workers retry removed nodes indefinitely

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: master
    • Fix Version/s: Mad-Hatter
    • Component/s: cbdatasource
    • Labels:
      None
    • Triage:
      Untriaged
    • Is this a Regression?:
      No

      Description

      Apologies if this is assigned to the wrong person, I couldn't find a clear owner for cbdatasource in Jira.

      Repro:

      1. Start a 2+ node cluster (any version, I used v5.5.2 EE)
      2. Connect using cbdatasource.NewBucketDataSource()
      3. Hard-failover one node and rebalance to remove it from the cluster
      4. See via options.Logf logging that the vbucket/cluster map has updated properly (remaining node(s) now responsible for 1024 vbuckets)
      5. Also see that the ExponentialBackoffLoop is still indefinitely trying to connect to the removed node

      Detail:

      cbdatasource is falling into the following error block under this scenario, but it doesn't have the same logic to empty the worker channel as was applied in MB-25912

      https://github.com/couchbase/go-couchbase/blob/595f46701cbcad107f3571d9bc8db4cebbc95cf7/cbdatasource/cbdatasource.go#L920-L926

      But there's also a tweak required in emptyWorkerCh to properly terminate the retry loop when the workerCh is closed.

      Proposed fix:

      I've fixed the issue locally using the following patch, but I'm unsure of potential side-effects. Please let me know what you think.

      diff --git a/cbdatasource/cbdatasource.go b/cbdatasource/cbdatasource.go
      index b5620a3..fac8f58 100644
      --- a/cbdatasource/cbdatasource.go
      +++ b/cbdatasource/cbdatasource.go
      @@ -846,18 +846,19 @@ func (d *bucketDataSource) worker(server string, workerCh chan []uint16) int {
       		connect = memcached.Connect
       	}
       
      -	emptyWorkerCh := func() {
      +	emptyWorkerCh := func() int {
       		for {
       			select {
       			case _, ok := <-workerCh:
       				if !ok {
      -					return
      +					return -1 // workerCh was closed
       				}
       
       				// Else, keep looping to consume workerCh.
       
       			default:
      -				return // Stop loop when workerCh is empty.
      +				d.refreshWorkersCh <- "workerCh-emptied"
      +				return 0 // Stop loop when workerCh is empty.
       			}
       		}
       	}
      @@ -872,11 +873,11 @@ func (d *bucketDataSource) worker(server string, workerCh chan []uint16) int {
       		// or failed-over, so consume the workerCh so that the
       		// refresh-cluster goroutine will be unblocked and can receive
       		// our kick.
      -		emptyWorkerCh()
      +		ret := emptyWorkerCh()
       
       		d.Kick("worker-connect-err")
       
      -		return 0
      +		return ret
       	}
       	atomic.AddUint64(&d.stats.TotWorkerConnectOk, 1)
       
      @@ -902,11 +903,11 @@ func (d *bucketDataSource) worker(server string, workerCh chan []uint16) int {
       				// rebalanced out, so consume the workerCh so that the
       				// refresh-cluster goroutine will be unblocked and can
       				// receive our kick.
      -				emptyWorkerCh()
      +				ret := emptyWorkerCh()
       
       				d.Kick("worker-auth-AuthenticateMemcachedConn")
       
      -				return 0
      +				return ret
       			}
       			atomic.AddUint64(&d.stats.TotWorkerAuthenticateMemcachedConnOk, 1)
       		} else if auth, ok := d.auth.(couchbase.AuthWithSaslHandler); ok {
      @@ -922,7 +923,16 @@ func (d *bucketDataSource) worker(server string, workerCh chan []uint16) int {
       				atomic.AddUint64(&d.stats.TotWorkerAuthErr, 1)
       				d.receiver.OnError(fmt.Errorf("worker auth, server: %s,"+
       					" user: %s, err: %v", server, user, err))
      -				return 0
      +
      +				// If we can't authenticate, then maybe a node was
      +				// rebalanced out, so consume the workerCh so that the
      +				// refresh-cluster goroutine will be unblocked and can
      +				// receive our kick.
      +				ret := emptyWorkerCh()
      +
      +				d.Kick("worker-auth-client")
      +
      +				return ret
       			}
       
       			if res.Status != gomemcached.SUCCESS {
      

        Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

          Hide
          steve Steve Yen added a comment -

          Hi Ben Brooks – thanks! This looks legit to me. cc'ing @sreekanth, too

          Also, cbdatasource is under gerrit review system control. Might I ask you to push up this as a change-review on review.couchbase.org so you can become immortal (or at least, become part of the change history of go-couchbase)? http://review.couchbase.org/#/q/project:go-couchbase

          thanks!
          – steve

          Show
          steve Steve Yen added a comment - Hi Ben Brooks – thanks! This looks legit to me. cc'ing @sreekanth, too Also, cbdatasource is under gerrit review system control. Might I ask you to push up this as a change-review on review.couchbase.org so you can become immortal (or at least, become part of the change history of go-couchbase)? http://review.couchbase.org/#/q/project:go-couchbase thanks! – steve
          Hide
          ben.brooks Ben Brooks added a comment -

          Hey Steve Yen,

          I've just got that patch pushed up at http://review.couchbase.org/c/101921/ and assigned yourself and Sreekanth Sivasankaran as reviewers

          Thanks,
          Ben

          Show
          ben.brooks Ben Brooks added a comment - Hey Steve Yen , I've just got that patch pushed up at http://review.couchbase.org/c/101921/ and assigned yourself and Sreekanth Sivasankaran as reviewers Thanks, Ben
          Hide
          build-team Couchbase Build Team added a comment -

          Build couchbase-server-6.5.0-1788 contains go-couchbase commit 4fab5d1 with commit message:
          MB-32044 Stop cbdatasource worker retry loop for removed nodes

          Show
          build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.0-1788 contains go-couchbase commit 4fab5d1 with commit message: MB-32044 Stop cbdatasource worker retry loop for removed nodes
          Hide
          build-team Couchbase Build Team added a comment -

          Build sync_gateway-2.5.0-134 contains go-couchbase commit 4fab5d1 with commit message:
          MB-32044 Stop cbdatasource worker retry loop for removed nodes

          Show
          build-team Couchbase Build Team added a comment - Build sync_gateway-2.5.0-134 contains go-couchbase commit 4fab5d1 with commit message: MB-32044 Stop cbdatasource worker retry loop for removed nodes
          Hide
          build-team Couchbase Build Team added a comment -

          Build sync_gateway-2.5.0-134 contains sync_gateway commit a1108a8 with commit message:
          Bump manifest for go-couchbase to uptake MB-32044 (#3855)

          Show
          build-team Couchbase Build Team added a comment - Build sync_gateway-2.5.0-134 contains sync_gateway commit a1108a8 with commit message: Bump manifest for go-couchbase to uptake MB-32044 (#3855)
          Hide
          steve Steve Yen added a comment -

          Giving old tickets a scrub and ran into this – marking it resolved. Thanks Ben for the fix!!

          Show
          steve Steve Yen added a comment - Giving old tickets a scrub and ran into this – marking it resolved. Thanks Ben for the fix!!

            People

            • Assignee:
              ben.brooks Ben Brooks
              Reporter:
              ben.brooks Ben Brooks
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Gerrit Reviews

                There are no open Gerrit changes

                  PagerDuty

                  Error rendering 'com.pagerduty.jira-server-plugin:PagerDuty'. Please contact your Jira administrators.