Fixed
Pinned fields
Click on the next to a field label to start pinning.
Details
Assignee
Ayush NayyarAyush NayyarReporter
Neil HuangNeil HuangIs this a Regression?
YesTriage
UntriagedStory Points
0Priority
Test BlockerInstabug
Open Instabug
Details
Details
Assignee
Ayush Nayyar
Ayush NayyarReporter
Neil Huang
Neil HuangIs this a Regression?
Yes
Triage
Untriaged
Story Points
0
Priority
Instabug
Open Instabug
PagerDuty
PagerDuty
PagerDuty
Sentry
Sentry
Sentry
Zendesk Support
Zendesk Support
Zendesk Support
Created July 11, 2023 at 12:16 AM
Updated September 18, 2023 at 3:28 PM
Resolved July 11, 2023 at 4:35 AM
Executive Summary:
goxdcr handling of metakv callbacks have been raceful since inception
goxdcr has a hidden race condition between the replication settings upgrade path and the actual user-induced replication settings change when issued from REST API. The upgrade path was introduced early XDCR (i.e. 5.0) IIRC.
As part of binary filter improvements (), a un-necessary replication setting key was added to the internal data structure, causing the upgrade path (#2 above) to be triggered every single time a replication is changed (such as pause / resume or any setting change). This exposes the race condition in #2. And as a result, also exposes #1.
Fixing both 1, 2, and 3 is necessary to fully correct the situation.
Original description
Update 7/6: We haven't seen this one before in goxdcr tests. It seems that ns_servers make simple test is also showing this failure. Component is unknown. Conversation: https://couchbase.slack.com/archives/CC6NF8ERY/p1688692635407479
This is found during a regular run of the developer's collections test suite. We have 1-node source cluster and a 1-node target cluster
Due to an unknown race, the metakv's object revision ID could be out of sync, leading to the "revision number does not match" issue. I find it hard to believe because this is a single node cluster and there shouldn't be any other writers or readers of the data in metakv.
2023-07-05T22:53:34.344-07:00 INFO GOXDCR.AdminPort: doChangeReplicationSettingsRequest 2023-07-05T22:53:34.344-07:00 INFO GOXDCR.AdminPort: Request params: replicationId=bc959726b1e9e8426810fceb3d6e9a2f/B1/B2 justValidate=false includeWarnings=false 2023-07-05T22:53:34.344-07:00 INFO GOXDCR.AdminPort: Request params: justValidate=false includeWarnings=false inputSettings=map[colMappingRules:map[S1:S1]] 2023-07-05T22:53:34.344-07:00 INFO GOXDCR.ReplMgr: Update replication settings for bc959726b1e9e8426810fceb3d6e9a2f/B1/B2, settings=map[colMappingRules:map[S1:S1]], justValidate=false 2023-07-05T22:53:34.344-07:00 INFO GOXDCR.ReplSpecSvc: Successfully retrieved target cluster reference 0x1005ec4e0. time taken=2.56µs 2023-07-05T22:53:34.346-07:00 ERRO GOXDCR.HttpServer: Internal error in adminport, revision number does not match 2023-07-05T22:53:35.424-07:00 INFO GOXDCR.AdminPort: doChangeReplicationSettingsRequest 2023-07-05T22:53:35.424-07:00 INFO GOXDCR.AdminPort: Request params: replicationId=bc959726b1e9e8426810fceb3d6e9a2f/B1/B2 justValidate=false includeWarnings=false 2023-07-05T22:53:35.424-07:00 INFO GOXDCR.AdminPort: Request params: justValidate=false includeWarnings=false inputSettings=map[active:true] 2023-07-05T22:53:35.424-07:00 INFO GOXDCR.ReplMgr: Update replication settings for bc959726b1e9e8426810fceb3d6e9a2f/B1/B2, settings=map[active:true], justValidate=false 2023-07-05T22:53:35.424-07:00 INFO GOXDCR.ReplSpecSvc: Successfully retrieved target cluster reference 0x1005ec4e0. time taken=2.979µs 2023-07-05T22:53:35.425-07:00 ERRO GOXDCR.HttpServer: Internal error in adminport, revision number does not match
From this point on, the replication in-memory is unable to dominate the metakv entries, and any attempt to change or delete the replication will result in error.
Deleting the source bucket should have caused gc to take place, but even that fails:
2023-07-05T22:56:44.358-07:00 WARN GOXDCR.ReplSpecSvc: Error validating replication specification bc959726b1e9e8426810fceb3d6e9a2f/B1/B2. error=Bucket B1 UUID has changed from ed2eb7693b679fb7fb66b69940b1ca5c to 7de06604fe0e9ab31fd501e0d7bb1ab2, indicating a bucket deletion and recreation 2023-07-05T22:56:44.358-07:00 ERRO GOXDCR.ReplSpecSvc: Replication specification bc959726b1e9e8426810fceb3d6e9a2f/B1/B2 is no longer valid, garbage collect it. error=Bucket B1 UUID has changed from ed2eb7693b679fb7fb66b69940b1ca5c to 7de06604fe0e9ab31fd501e0d7bb1ab2, indicating a bucket deletion and recreation 2023-07-05T22:56:44.360-07:00 ERRO GOXDCR.ReplSpecSvc: Failed to delete replication spec, key=replicationSpec/bc959726b1e9e8426810fceb3d6e9a2f/B1/B2, err=revision number does not match 2023-07-05T22:56:44.360-07:00 INFO GOXDCR.ReplSpecSvc: Failed to garbage collect spec bc959726b1e9e8426810fceb3d6e9a2f/B1/B2, err=revision number does not match 2023-07-05T22:56:44.360-07:00 INFO GOXDCR.PipelineMgr: Replication Status = map[bc959726b1e9e8426810fceb3d6e9a2f/B1/B2:name={bc959726b1e9e8426810fceb3d6e9a2f/B1/B2}, status={Paused}, errors={[]}, oldProgress={Source nozzles have been closed}, progress={Pipeline has been stopped}, oldBackfillProgress={Source nozzles have been closed}, backfillProgress={Pipeline has been stopped}]
We may need to look into if the xdcr process needs to properly handle these types of revision issues. This is a one-node source cluster and restarting force kill goxdcr may not be a kosher option in production.
Issue
Resolution
A legacy race condition where metadata store could cause a conflict was exposed as part of the binary filter improvements.
Legacy race conditions have all been resolved.