[BP 7.1.5] - XDCR - metakv "revision does not match" shows up

Description

Executive Summary:

  1. goxdcr handling of metakv callbacks have been raceful since inception

  2. goxdcr has a hidden race condition between the replication settings upgrade path and the actual user-induced replication settings change when issued from REST API. The upgrade path was introduced early XDCR (i.e. 5.0) IIRC.

  3. As part of binary filter improvements (), a un-necessary replication setting key was added to the internal data structure, causing the upgrade path (#2 above) to be triggered every single time a replication is changed (such as pause / resume or any setting change). This exposes the race condition in #2. And as a result, also exposes #1.

  4. Fixing both 1, 2, and 3 is necessary to fully correct the situation.

Original description
Update 7/6: We haven't seen this one before in goxdcr tests. It seems that ns_servers make simple test is also showing this failure. Component is unknown. Conversation: https://couchbase.slack.com/archives/CC6NF8ERY/p1688692635407479

This is found during a regular run of the developer's collections test suite. We have 1-node source cluster and a 1-node target cluster

Due to an unknown race, the metakv's object revision ID could be out of sync, leading to the "revision number does not match" issue. I find it hard to believe because this is a single node cluster and there shouldn't be any other writers or readers of the data in metakv.

From this point on, the replication in-memory is unable to dominate the metakv entries, and any attempt to change or delete the replication will result in error.

Deleting the source bucket should have caused gc to take place, but even that fails:

We may need to look into if the xdcr process needs to properly handle these types of revision issues. This is a one-node source cluster and restarting force kill goxdcr may not be a kosher option in production.

 

Issue

Resolution

A legacy race condition where metadata store could cause a conflict was exposed as part of the binary filter improvements.

Legacy race conditions have all been resolved.

Components

Affects versions

Fix versions

Labels

Environment

None

Link to Log File, atop/blg, CBCollectInfo, Core dump

None

Release Notes Description

None

Attachments

3

Activity

Show:

Ayush Nayyar July 14, 2023 at 5:46 AM

Reproduced on 7.1.5-3830, verified on 7.1.5-3876.

Ashok Kumar Alluri July 13, 2023 at 2:37 PM

Regression runs on the new build with fix are still going on, as of now we did not hit the issue yet. Whatever the jobs shown up in Greenboard, we have 98.3% pass rate. Once Greenboard is updated with latest runs, we will get better idea.

CC   

CB robot July 11, 2023 at 5:12 PM

Build couchbase-server-7.1.5-3876 contains goxdcr commit c132394 with commit message:
: race conditions in metakv callback and settings upgrade

CB robot July 11, 2023 at 5:12 PM

Build couchbase-server-7.1.5-3876 contains goxdcr commit 72ee590 with commit message:
: remove un-necessary filter binary document setting

Fixed
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Is this a Regression?

Yes

Triage

Untriaged

Story Points

Priority

Instabug

Open Instabug

PagerDuty

Sentry

Zendesk Support

Created July 11, 2023 at 12:17 AM
Updated September 19, 2023 at 10:37 AM
Resolved July 11, 2023 at 3:35 PM
Instabug