Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-57788

[BP 7.1.5] - XDCR - metakv "revision does not match" shows up

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Test Blocker
    • 7.1.5
    • 7.6.0, 7.2.1, 7.1.5
    • XDCR
    • Untriaged
    • 0
    • Yes

    Description

      Executive Summary:

      1. goxdcr handling of metakv callbacks have been raceful since inception
      2. goxdcr has a hidden race condition between the replication settings upgrade path and the actual user-induced replication settings change when issued from REST API. The upgrade path was introduced early XDCR (i.e. 5.0) IIRC.
      3. As part of binary filter improvements (MB-56739), a un-necessary replication setting key was added to the internal data structure, causing the upgrade path (#2 above) to be triggered every single time a replication is changed (such as pause / resume or any setting change). This exposes the race condition in #2. And as a result, also exposes #1.
      4. Fixing both 1, 2, and 3 is necessary to fully correct the situation.

      Original description
      Update 7/6: We haven't seen this one before in goxdcr tests. It seems that ns_servers make simple test is also showing this failure. Component is unknown. Conversation: https://couchbase.slack.com/archives/CC6NF8ERY/p1688692635407479

      This is found during a regular run of the developer's collections test suite. We have 1-node source cluster and a 1-node target cluster

      Due to an unknown race, the metakv's object revision ID could be out of sync, leading to the "revision number does not match" issue. I find it hard to believe because this is a single node cluster and there shouldn't be any other writers or readers of the data in metakv.

      2023-07-05T22:53:34.344-07:00 INFO GOXDCR.AdminPort: doChangeReplicationSettingsRequest
      2023-07-05T22:53:34.344-07:00 INFO GOXDCR.AdminPort: Request params: replicationId=bc959726b1e9e8426810fceb3d6e9a2f/B1/B2 justValidate=false includeWarnings=false
      2023-07-05T22:53:34.344-07:00 INFO GOXDCR.AdminPort: Request params: justValidate=false includeWarnings=false inputSettings=map[colMappingRules:map[S1:S1]]
      2023-07-05T22:53:34.344-07:00 INFO GOXDCR.ReplMgr: Update replication settings for bc959726b1e9e8426810fceb3d6e9a2f/B1/B2, settings=map[colMappingRules:map[S1:S1]], justValidate=false
      2023-07-05T22:53:34.344-07:00 INFO GOXDCR.ReplSpecSvc: Successfully retrieved target cluster reference 0x1005ec4e0. time taken=2.56µs
      2023-07-05T22:53:34.346-07:00 ERRO GOXDCR.HttpServer: Internal error in adminport, revision number does not match
      2023-07-05T22:53:35.424-07:00 INFO GOXDCR.AdminPort: doChangeReplicationSettingsRequest
      2023-07-05T22:53:35.424-07:00 INFO GOXDCR.AdminPort: Request params: replicationId=bc959726b1e9e8426810fceb3d6e9a2f/B1/B2 justValidate=false includeWarnings=false
      2023-07-05T22:53:35.424-07:00 INFO GOXDCR.AdminPort: Request params: justValidate=false includeWarnings=false inputSettings=map[active:true]
      2023-07-05T22:53:35.424-07:00 INFO GOXDCR.ReplMgr: Update replication settings for bc959726b1e9e8426810fceb3d6e9a2f/B1/B2, settings=map[active:true], justValidate=false
      2023-07-05T22:53:35.424-07:00 INFO GOXDCR.ReplSpecSvc: Successfully retrieved target cluster reference 0x1005ec4e0. time taken=2.979µs
      2023-07-05T22:53:35.425-07:00 ERRO GOXDCR.HttpServer: Internal error in adminport, revision number does not match
      

      From this point on, the replication in-memory is unable to dominate the metakv entries, and any attempt to change or delete the replication will result in error.

      Deleting the source bucket should have caused gc to take place, but even that fails:

      2023-07-05T22:56:44.358-07:00 WARN GOXDCR.ReplSpecSvc: Error validating replication specification bc959726b1e9e8426810fceb3d6e9a2f/B1/B2. error=Bucket B1 UUID has changed from ed2eb7693b679fb7fb66b69940b1ca5c to 7de06604fe0e9ab31fd501e0d7bb1ab2, indicating a bucket deletion and recreation
      2023-07-05T22:56:44.358-07:00 ERRO GOXDCR.ReplSpecSvc: Replication specification bc959726b1e9e8426810fceb3d6e9a2f/B1/B2 is no longer valid, garbage collect it. error=Bucket B1 UUID has changed from ed2eb7693b679fb7fb66b69940b1ca5c to 7de06604fe0e9ab31fd501e0d7bb1ab2, indicating a bucket deletion and recreation
      2023-07-05T22:56:44.360-07:00 ERRO GOXDCR.ReplSpecSvc: Failed to delete replication spec, key=replicationSpec/bc959726b1e9e8426810fceb3d6e9a2f/B1/B2, err=revision number does not match
      2023-07-05T22:56:44.360-07:00 INFO GOXDCR.ReplSpecSvc: Failed to garbage collect spec bc959726b1e9e8426810fceb3d6e9a2f/B1/B2, err=revision number does not match
      2023-07-05T22:56:44.360-07:00 INFO GOXDCR.PipelineMgr: Replication Status = map[bc959726b1e9e8426810fceb3d6e9a2f/B1/B2:name={bc959726b1e9e8426810fceb3d6e9a2f/B1/B2}, status={Paused}, errors={[]}, oldProgress={Source nozzles have been closed}, progress={Pipeline has been stopped}, oldBackfillProgress={Source nozzles have been closed}, backfillProgress={Pipeline has been stopped}]
      

      We may need to look into if the xdcr process needs to properly handle these types of revision issues. This is a one-node source cluster and restarting force kill goxdcr may not be a kosher option in production.

       

      Issue Resolution
      A legacy race condition where metadata store could cause a conflict was exposed as part of the binary filter improvements. Legacy race conditions have all been resolved.

      Attachments

        1. image-2023-07-13-20-06-46-851.png
          image-2023-07-13-20-06-46-851.png
          313 kB
        2. mismatch.log
          19.18 MB
        3. results0.zip
          33.58 MB

        Issue Links

          For Gerrit Dashboard: MB-57788
          # Subject Branch Project Status CR V

          Activity

            People

              ayush.nayyar Ayush Nayyar
              neil.huang Neil Huang
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty