Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: 6.0.1
Affects Version/s: 5.5.1
Component/s: eventing
Labels:
None

Triage:
Untriaged
Is this a Regression?:
Unknown

Description

We have observed in the wild a situation where, if the sole eventing node is removed from the cluster, by means of rebalance or by failover, when the node is then added back to the cluster, rebalance hangs.

We are able to consistently reproduce the issue with the following steps:

1. Create a cluster with at least one Data Service node, and one Eventing Service node
2. Deploy any function on a bucket
3. Remove or failover the eventing node, and rebalance
4. Add the eventing node back to the cluster, rebalance

After this, the rebalance hangs.

Further testing has revealed we can reproduce the same issue with multiple eventing nodes if they are all removed/ failed over at the same time.

Steps to reproduce:

1. Create a cluster with at least one Data Service node, and at least 2 Eventing Service nodes
2. Deploy any function on a bucket
3. Remove or failover all eventing nodes at once, and rebalance
4. Add the eventing nodes back to the cluster, rebalance

Looking to why this is, we can see in the logs of our testing cluster that a number of vBuckets are stuck attempting a "vbTakeover request", where we see the following, repeating several times for a number of vBuckets:

2018-12-17T14:33:16.426+00:00 [Info] Consumer::doVbTakeover [worker_testingfunction_0:/tmp/127.0.0.1:8091_worker_testingfunction_0.sock:5040] vb: 1 dcp stream status: running curr owner: 10.112.181.102:8096 worker: worker_testingfunction_0 UUID consumer: 1ac6b14016f1cedcfc098ae65171fabd from metadata: 8ee09b7e635f6e5b512616063c74006a check if current node should own vb: true

2018-12-17T14:33:16.426+00:00 [Info] Consumer::doVbTakeover [worker_testingfunction_0:/tmp/127.0.0.1:8091_worker_testingfunction_0.sock:5040] vb: 1 owned by node: 10.112.181.102:8096 worker: worker_testingfunction_0

2018-12-17T14:33:16.426+00:00 [Info] Consumer::vbTakeoverCallback [worker_testingfunction_0:/tmp/127.0.0.1:8091_worker_testingfunction_0.sock:5040] vb: 1 vbTakeover request, msg: vbucket is owned by another node

From the source code where this case is handled, I believe that the node ownership is determined by the node's UUID:

From https://github.com/couchbase/eventing/blob/5ca1a3284f74ee1e0e87b06d941604ae3eafd8af/consumer/vbucket_takeover.go:

	case dcpStreamRunning:

		logging.Infof("%s [%s:%s:%d] vb: %d dcp stream status: %s curr owner: %rs worker: %v UUID consumer: %s from metadata: %s check if current node should own vb: %t",

			logPrefix, c.workerName, c.tcpPort, c.Pid(), vb, vbBlob.DCPStreamStatus,

			vbBlob.CurrentVBOwner, vbBlob.AssignedWorker, c.NodeUUID(),

			vbBlob.NodeUUID, c.checkIfCurrentNodeShouldOwnVb(vb))

		if vbBlob.NodeUUID != c.NodeUUID() {

			// Case 1a: Some node that isn't part of the cluster has spawned DCP stream for the vbucket.

			//         Hence start the connection from consumer, discarding previous state.

			if !c.producer.IsEventingNodeAlive(vbBlob.CurrentVBOwner, vbBlob.NodeUUID) && c.checkIfCurrentNodeShouldOwnVb(vb) {

				logging.Infof("%s [%s:%s:%d] vb: %d node: %rs taking ownership. Old node: %rs isn't alive any more as per ns_server vbuuid: %s vblob.uuid: %s",

					logPrefix, c.workerName, c.tcpPort, c.Pid(), vb, c.HostPortAddr(), vbBlob.CurrentVBOwner,

					c.NodeUUID(), vbBlob.NodeUUID)

				return c.updateVbOwnerAndStartDCPStream(vbKey, vb, &vbBlob)

			// Case 1b: Invalid worker on another node is owning up vbucket stream

			if !util.Contains(vbBlob.AssignedWorker, possibleConsumers) {

				return c.updateVbOwnerAndStartDCPStream(vbKey, vb, &vbBlob)

		if vbBlob.NodeUUID == c.NodeUUID() {

			// Case 2a: Current consumer has already spawned DCP stream for the vbucket

			if vbBlob.AssignedWorker == c.ConsumerName() {

				logging.Infof("%s [%s:%s:%d] vb: %d current consumer and eventing node has already opened dcp stream. Stream status: %s, skipping",

					logPrefix, c.workerName, c.tcpPort, c.Pid(), vb, vbBlob.DCPStreamStatus)

				return nil

			logging.Infof("%s [%s:%s:%d] vb: %d owned by another worker: %s on same node",

				logPrefix, c.workerName, c.tcpPort, c.Pid(), vb, vbBlob.AssignedWorker)

			if !util.Contains(vbBlob.AssignedWorker, possibleConsumers) {

				// Case 2b: Worker who is invalid right now, has the ownership per metadata. Could happen for example:

				//         t1 - Eventing starts off with worker count 10

				//         t2 - Function was paused and resumed with worker count 3

				//         t3 - Eventing rebalance was kicked off and KV rolled back metadata bucket to t1

				//         This would currently cause rebalance to get stuck

				//         In this case, it makes sense to revoke ownership metadata of old owners.

				return c.updateVbOwnerAndStartDCPStream(vbKey, vb, &vbBlob)

			// Case 2c: An existing & running consumer on current Eventing node  has owned up the vbucket

			return errVbOwnedByAnotherWorker

		// Case 3: Another running Eventing node has the ownership of the vbucket stream

		logging.Infof("%s [%s:%s:%d] vb: %d owned by node: %s worker: %s",

			logPrefix, c.workerName, c.tcpPort, c.Pid(), vb, vbBlob.CurrentVBOwner, vbBlob.AssignedWorker)

		return errVbOwnedByAnotherNode

When a node is added is removed and added back to the cluster, it will be assigned a new UUID, despite having the same hostname. I believe this therefore results in a stale UUID, and the nodes believe that ownership of the vBucket belongs to another eventing node, hence why we end up stuck in this state.

I've not tested the impact on Eventing itself extensively, but it appears Eventing functions no longer work on the bucket when in this state also.

The workaround is to redeploy the Eventing function which refreshes the metadata, but this isn't ideal.

Attachments

Issue Links

relates to

MB-31462 Rebalancing out all eventing node make rebalance never completes

Closed

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Abhishek Singh (Inactive)

Reporter:: Toby Wilds (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 17/Dec/18 7:55 AM

Updated:: 08/Jan/19 2:11 AM

Resolved:: 22/Dec/18 3:01 AM

Gerrit Reviews

There are no open Gerrit changes

Show There are 4 closed Gerrit changes

Hide There are 4 closed Gerrit changes

MB-32383 Reset vb to node mapping when all Eventing nodes are ejected: Gerrit Review:

MB-32383 Check for invalid node uuids at the time of vbucket takeover: Gerrit Review:

MB-32383 Check for invalid node uuids at the time of vbucket takeover: Gerrit Review:

MB-32383 Mitigate duplicate events during rebalance: Gerrit Review:

Eventing: Rebalance hangs if all eventing nodes are removed and added back to cluster

Details

Description

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty