[BP 7.2.1] - XDCR - metakv "revision does not match" shows up

Description

Executive Summary:

  1. goxdcr handling of metakv callbacks have been raceful since inception

  2. goxdcr has a hidden race condition between the replication settings upgrade path and the actual user-induced replication settings change when issued from REST API. The upgrade path was introduced early XDCR (i.e. 5.0) IIRC.

  3. As part of binary filter improvements (), a un-necessary replication setting key was added to the internal data structure, causing the upgrade path (#2 above) to be triggered every single time a replication is changed (such as pause / resume or any setting change). This exposes the race condition in #2. And as a result, also exposes #1.

  4. Fixing both 1, 2, and 3 is necessary to fully correct the situation.

Original description
Update 7/6: We haven't seen this one before in goxdcr tests. It seems that ns_servers make simple test is also showing this failure. Component is unknown. Conversation: https://couchbase.slack.com/archives/CC6NF8ERY/p1688692635407479

This is found during a regular run of the developer's collections test suite. We have 1-node source cluster and a 1-node target cluster

Due to an unknown race, the metakv's object revision ID could be out of sync, leading to the "revision number does not match" issue. I find it hard to believe because this is a single node cluster and there shouldn't be any other writers or readers of the data in metakv.

2023-07-05T22:53:34.344-07:00 INFO GOXDCR.AdminPort: doChangeReplicationSettingsRequest 2023-07-05T22:53:34.344-07:00 INFO GOXDCR.AdminPort: Request params: replicationId=bc959726b1e9e8426810fceb3d6e9a2f/B1/B2 justValidate=false includeWarnings=false 2023-07-05T22:53:34.344-07:00 INFO GOXDCR.AdminPort: Request params: justValidate=false includeWarnings=false inputSettings=map[colMappingRules:map[S1:S1]] 2023-07-05T22:53:34.344-07:00 INFO GOXDCR.ReplMgr: Update replication settings for bc959726b1e9e8426810fceb3d6e9a2f/B1/B2, settings=map[colMappingRules:map[S1:S1]], justValidate=false 2023-07-05T22:53:34.344-07:00 INFO GOXDCR.ReplSpecSvc: Successfully retrieved target cluster reference 0x1005ec4e0. time taken=2.56µs 2023-07-05T22:53:34.346-07:00 ERRO GOXDCR.HttpServer: Internal error in adminport, revision number does not match 2023-07-05T22:53:35.424-07:00 INFO GOXDCR.AdminPort: doChangeReplicationSettingsRequest 2023-07-05T22:53:35.424-07:00 INFO GOXDCR.AdminPort: Request params: replicationId=bc959726b1e9e8426810fceb3d6e9a2f/B1/B2 justValidate=false includeWarnings=false 2023-07-05T22:53:35.424-07:00 INFO GOXDCR.AdminPort: Request params: justValidate=false includeWarnings=false inputSettings=map[active:true] 2023-07-05T22:53:35.424-07:00 INFO GOXDCR.ReplMgr: Update replication settings for bc959726b1e9e8426810fceb3d6e9a2f/B1/B2, settings=map[active:true], justValidate=false 2023-07-05T22:53:35.424-07:00 INFO GOXDCR.ReplSpecSvc: Successfully retrieved target cluster reference 0x1005ec4e0. time taken=2.979µs 2023-07-05T22:53:35.425-07:00 ERRO GOXDCR.HttpServer: Internal error in adminport, revision number does not match

From this point on, the replication in-memory is unable to dominate the metakv entries, and any attempt to change or delete the replication will result in error.

Deleting the source bucket should have caused gc to take place, but even that fails:

2023-07-05T22:56:44.358-07:00 WARN GOXDCR.ReplSpecSvc: Error validating replication specification bc959726b1e9e8426810fceb3d6e9a2f/B1/B2. error=Bucket B1 UUID has changed from ed2eb7693b679fb7fb66b69940b1ca5c to 7de06604fe0e9ab31fd501e0d7bb1ab2, indicating a bucket deletion and recreation 2023-07-05T22:56:44.358-07:00 ERRO GOXDCR.ReplSpecSvc: Replication specification bc959726b1e9e8426810fceb3d6e9a2f/B1/B2 is no longer valid, garbage collect it. error=Bucket B1 UUID has changed from ed2eb7693b679fb7fb66b69940b1ca5c to 7de06604fe0e9ab31fd501e0d7bb1ab2, indicating a bucket deletion and recreation 2023-07-05T22:56:44.360-07:00 ERRO GOXDCR.ReplSpecSvc: Failed to delete replication spec, key=replicationSpec/bc959726b1e9e8426810fceb3d6e9a2f/B1/B2, err=revision number does not match 2023-07-05T22:56:44.360-07:00 INFO GOXDCR.ReplSpecSvc: Failed to garbage collect spec bc959726b1e9e8426810fceb3d6e9a2f/B1/B2, err=revision number does not match 2023-07-05T22:56:44.360-07:00 INFO GOXDCR.PipelineMgr: Replication Status = map[bc959726b1e9e8426810fceb3d6e9a2f/B1/B2:name={bc959726b1e9e8426810fceb3d6e9a2f/B1/B2}, status={Paused}, errors={[]}, oldProgress={Source nozzles have been closed}, progress={Pipeline has been stopped}, oldBackfillProgress={Source nozzles have been closed}, backfillProgress={Pipeline has been stopped}]

We may need to look into if the xdcr process needs to properly handle these types of revision issues. This is a one-node source cluster and restarting force kill goxdcr may not be a kosher option in production.

 

Issue

Resolution

A legacy race condition where metadata store could cause a conflict was exposed as part of the binary filter improvements.

Legacy race conditions have all been resolved.

Components

Affects versions

Fix versions

Labels

Environment

None

Link to Log File, atop/blg, CBCollectInfo, Core dump

None

Release Notes Description

None

Attachments

2

Activity

Show:

Ayush Nayyar July 18, 2023 at 11:22 AM

Verified on 7.2.1-5861. No instances of error found, regressions and manual test clean.

CB robot July 11, 2023 at 7:32 AM

Build couchbase-server-7.2.1-5850 contains goxdcr commit 4d13734 with commit message:
: race conditions in metakv callback and settings upgrade

Fixed
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Is this a Regression?

Yes

Triage

Untriaged

Story Points

Priority

Instabug

Open Instabug

PagerDuty

Sentry

Zendesk Support

Created July 11, 2023 at 12:16 AM
Updated September 18, 2023 at 3:28 PM
Resolved July 11, 2023 at 4:35 AM
Instabug