Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-32383

Eventing: Rebalance hangs if all eventing nodes are removed and added back to cluster

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • 6.0.1
    • 5.5.1
    • eventing
    • None
    • Untriaged
    • Unknown

    Description

      We have observed in the wild a situation where, if the sole eventing node is removed from the cluster, by means of rebalance or by failover, when the node is then added back to the cluster, rebalance hangs.

      We are able to consistently reproduce the issue with the following steps:

      1. Create a cluster with at least one Data Service node, and one Eventing Service node
      2. Deploy any function on a bucket
      3. Remove or failover the eventing node, and rebalance
      4. Add the eventing node back to the cluster, rebalance

      After this, the rebalance hangs.

      Further testing has revealed we can reproduce the same issue with multiple eventing nodes if they are all removed/ failed over at the same time.

      Steps to reproduce:

      1. Create a cluster with at least one Data Service node, and at least 2 Eventing Service nodes
      2. Deploy any function on a bucket
      3. Remove or failover all eventing nodes at once, and rebalance
      4. Add the eventing nodes back to the cluster, rebalance

      Looking to why this is, we can see in the logs of our testing cluster that a number of vBuckets are stuck attempting a "vbTakeover request", where we see the following, repeating several times for a number of vBuckets:

      2018-12-17T14:33:16.426+00:00 [Info] Consumer::doVbTakeover [worker_testingfunction_0:/tmp/127.0.0.1:8091_worker_testingfunction_0.sock:5040] vb: 1 dcp stream status: running curr owner: 10.112.181.102:8096 worker: worker_testingfunction_0 UUID consumer: 1ac6b14016f1cedcfc098ae65171fabd from metadata: 8ee09b7e635f6e5b512616063c74006a check if current node should own vb: true
      2018-12-17T14:33:16.426+00:00 [Info] Consumer::doVbTakeover [worker_testingfunction_0:/tmp/127.0.0.1:8091_worker_testingfunction_0.sock:5040] vb: 1 owned by node: 10.112.181.102:8096 worker: worker_testingfunction_0
      2018-12-17T14:33:16.426+00:00 [Info] Consumer::vbTakeoverCallback [worker_testingfunction_0:/tmp/127.0.0.1:8091_worker_testingfunction_0.sock:5040] vb: 1 vbTakeover request, msg: vbucket is owned by another node
      

      From the source code where this case is handled, I believe that the node ownership is determined by the node's UUID:

      From https://github.com/couchbase/eventing/blob/5ca1a3284f74ee1e0e87b06d941604ae3eafd8af/consumer/vbucket_takeover.go:

      	case dcpStreamRunning:
       
      		logging.Infof("%s [%s:%s:%d] vb: %d dcp stream status: %s curr owner: %rs worker: %v UUID consumer: %s from metadata: %s check if current node should own vb: %t",
      			logPrefix, c.workerName, c.tcpPort, c.Pid(), vb, vbBlob.DCPStreamStatus,
      			vbBlob.CurrentVBOwner, vbBlob.AssignedWorker, c.NodeUUID(),
      			vbBlob.NodeUUID, c.checkIfCurrentNodeShouldOwnVb(vb))
       
      		if vbBlob.NodeUUID != c.NodeUUID() {
      			// Case 1a: Some node that isn't part of the cluster has spawned DCP stream for the vbucket.
      			//         Hence start the connection from consumer, discarding previous state.
      			if !c.producer.IsEventingNodeAlive(vbBlob.CurrentVBOwner, vbBlob.NodeUUID) && c.checkIfCurrentNodeShouldOwnVb(vb) {
      				logging.Infof("%s [%s:%s:%d] vb: %d node: %rs taking ownership. Old node: %rs isn't alive any more as per ns_server vbuuid: %s vblob.uuid: %s",
      					logPrefix, c.workerName, c.tcpPort, c.Pid(), vb, c.HostPortAddr(), vbBlob.CurrentVBOwner,
      					c.NodeUUID(), vbBlob.NodeUUID)
      				return c.updateVbOwnerAndStartDCPStream(vbKey, vb, &vbBlob)
      			}
       
      			// Case 1b: Invalid worker on another node is owning up vbucket stream
      			if !util.Contains(vbBlob.AssignedWorker, possibleConsumers) {
      				return c.updateVbOwnerAndStartDCPStream(vbKey, vb, &vbBlob)
      			}
      		}
       
      		if vbBlob.NodeUUID == c.NodeUUID() {
      			// Case 2a: Current consumer has already spawned DCP stream for the vbucket
      			if vbBlob.AssignedWorker == c.ConsumerName() {
      				logging.Infof("%s [%s:%s:%d] vb: %d current consumer and eventing node has already opened dcp stream. Stream status: %s, skipping",
      					logPrefix, c.workerName, c.tcpPort, c.Pid(), vb, vbBlob.DCPStreamStatus)
      				return nil
      			}
       
      			logging.Infof("%s [%s:%s:%d] vb: %d owned by another worker: %s on same node",
      				logPrefix, c.workerName, c.tcpPort, c.Pid(), vb, vbBlob.AssignedWorker)
       
      			if !util.Contains(vbBlob.AssignedWorker, possibleConsumers) {
      				// Case 2b: Worker who is invalid right now, has the ownership per metadata. Could happen for example:
      				//         t1 - Eventing starts off with worker count 10
      				//         t2 - Function was paused and resumed with worker count 3
      				//         t3 - Eventing rebalance was kicked off and KV rolled back metadata bucket to t1
      				//         This would currently cause rebalance to get stuck
      				//         In this case, it makes sense to revoke ownership metadata of old owners.
      				return c.updateVbOwnerAndStartDCPStream(vbKey, vb, &vbBlob)
      			}
       
      			// Case 2c: An existing & running consumer on current Eventing node  has owned up the vbucket
      			return errVbOwnedByAnotherWorker
      		}
       
      		// Case 3: Another running Eventing node has the ownership of the vbucket stream
      		logging.Infof("%s [%s:%s:%d] vb: %d owned by node: %s worker: %s",
      			logPrefix, c.workerName, c.tcpPort, c.Pid(), vb, vbBlob.CurrentVBOwner, vbBlob.AssignedWorker)
      		return errVbOwnedByAnotherNode
      

      When a node is added is removed and added back to the cluster, it will be assigned a new UUID, despite having the same hostname. I believe this therefore results in a stale UUID, and the nodes believe that ownership of the vBucket belongs to another eventing node, hence why we end up stuck in this state.

      I've not tested the impact on Eventing itself extensively, but it appears Eventing functions no longer work on the bucket when in this state also.

      The workaround is to redeploy the Eventing function which refreshes the metadata, but this isn't ideal.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              asingh Abhishek Singh (Inactive)
              toby.wilds Toby Wilds (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  PagerDuty