OperationCanceledException continues after rebalance completes

Description

Initial discussion on the issue: https://forums.couchbase.com/t/couchbase-v3-sdk-kvnotmyvbucket-errors-after-add-node-rebalance/35438

 

Initial ticket for the issue: https://couchbasecloud.atlassian.net/browse/NCBC-3350

 

Preconditions:

  1. Have a cluster of 2 nodes where one get traffic (call it #1) and node waiting to be added to the cluster via rebalancing (call it #2)

  2. Have a .NET client sending requests to the cluster

Steps

  1. Block connections from the .NET client to the node #2

  2. Start rebalance operation

  3. In the middle of rebalance unblock connections to the node #2

  4. Wait for the rebalance to finish

Expected Result

The app is recovered after rebalance is finished and continue working properly

Actual Result

The app is not recovered and constantly throwing exceptions:

  • Timeouts exceptions before v3.4.5

  • TaskCancelledExceptions after v3.4.5 (this is not fixed in v3.4.6 which can be verified with this test)

Dev Notes

Lots of investigation details can be found on the forum topic and in the previous ticket (the links are above).

The latest details:

There is the difference between this patch set (that solves the issue) https://review.couchbase.org/c/couchbase-net-client/+/186991 and the released v3.4.5 (and v3.4.6) version.

It is in the ClusterContext.ProcessClusterMap method. In the patch and I believe in all previous versions it propagated all exceptions that might occur during connecting to the bucket class, but in the v3.4.5 there is a try catch block that handles all exceptions and just log them. This make the CouchbaseBucket.ConfigUpdatedAsync method to think that connection was established and update its CurrentConfig prop (which prevents from updating it in the future), but in the |Nodes collection there is still no new node.

Current code that hides exceptions: https://github.com/couchbase/couchbase-net-client/blob/master/src/Couchbase/Core/ClusterContext.cs#L813-L816

Environment

None

Gerrit Reviews

None

Release Notes Description

None

Attachments

1
  • 01 Jun 2023, 11:39 PM

Activity

Show:

Jeffry Morris June 2, 2023 at 4:49 PM

-

Good to hear, i'll look into the TaskCancellationException as yes they should be XxxTimeoutExceptions.

Jeff

Eugene Shcherbo June 2, 2023 at 4:04 PM

Hi  

I tested it and looks like the issue with rebalance is fixed in the package. Thank you.

 

Just FYI: before the cluster map updated I still saw the TaskCancelledException instead of timeouts. This is not an issue for me seems I know that it usually means timeout, but just to let you know.

Jeffry Morris June 1, 2023 at 11:39 PM

-

VF:

Jeffry Morris June 1, 2023 at 5:39 PM

Hello -

Indeed in this specific case, a config can appear to be processed, but had actually failed to be processed correctly leaving the SDK in a bad state until it can process a newer config revision successfully. It's definitely a client bug and a patch is in works for 3.4.7 which is planned for release 6/6/2023. Triggering the bug is somewhat of an edge-case we would expect the ports/hosts to be discoverable by the SDK while the rebalancing is occurring.

We will post a package for testing sometime today as a VF. It hasn't been through QE, so I wouldn't use it in production until the official v3.4.7 release is on NuGet.

Thanks,
Jeff

Eugene Shcherbo May 26, 2023 at 5:18 PM

Thank you  

I am sorry I think I wrote the wrong issue number in the ticket description. Initial ticket for this issue was https://couchbasecloud.atlassian.net/browse/NCBC-3334

Fixed
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Story Points

Fix versions

Affects versions

Priority

Instabug

Open Instabug

PagerDuty

Sentry

Zendesk Support

Created May 26, 2023 at 3:38 PM
Updated June 6, 2023 at 5:35 PM
Resolved June 6, 2023 at 5:35 PM
Instabug