Fixed
Pinned fields
Click on the next to a field label to start pinning.
Details
Assignee
Ayush NayyarAyush NayyarReporter
Neil HuangNeil HuangIs this a Regression?
YesTriage
UntriagedStory Points
0Priority
Test BlockerInstabug
Open Instabug
Details
Details
Assignee
Ayush Nayyar
Ayush NayyarReporter
Neil Huang
Neil HuangIs this a Regression?
Yes
Triage
Untriaged
Story Points
0
Priority
Instabug
Open Instabug
PagerDuty
PagerDuty
PagerDuty
Sentry
Sentry
Sentry
Zendesk Support
Zendesk Support
Zendesk Support
Created July 11, 2023 at 12:17 AM
Updated September 19, 2023 at 10:37 AM
Resolved July 11, 2023 at 3:35 PM
Executive Summary:
goxdcr handling of metakv callbacks have been raceful since inception
goxdcr has a hidden race condition between the replication settings upgrade path and the actual user-induced replication settings change when issued from REST API. The upgrade path was introduced early XDCR (i.e. 5.0) IIRC.
As part of binary filter improvements (), a un-necessary replication setting key was added to the internal data structure, causing the upgrade path (#2 above) to be triggered every single time a replication is changed (such as pause / resume or any setting change). This exposes the race condition in #2. And as a result, also exposes #1.
Fixing both 1, 2, and 3 is necessary to fully correct the situation.
Original description
Update 7/6: We haven't seen this one before in goxdcr tests. It seems that ns_servers make simple test is also showing this failure. Component is unknown. Conversation: https://couchbase.slack.com/archives/CC6NF8ERY/p1688692635407479
This is found during a regular run of the developer's collections test suite. We have 1-node source cluster and a 1-node target cluster
Due to an unknown race, the metakv's object revision ID could be out of sync, leading to the "revision number does not match" issue. I find it hard to believe because this is a single node cluster and there shouldn't be any other writers or readers of the data in metakv.
From this point on, the replication in-memory is unable to dominate the metakv entries, and any attempt to change or delete the replication will result in error.
Deleting the source bucket should have caused gc to take place, but even that fails:
We may need to look into if the xdcr process needs to properly handle these types of revision issues. This is a one-node source cluster and restarting force kill goxdcr may not be a kosher option in production.
Issue
Resolution
A legacy race condition where metadata store could cause a conflict was exposed as part of the binary filter improvements.
Legacy race conditions have all been resolved.