Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-59669

XDCR - checkpointMgr failover detection only works if seqno progressed

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • 7.6.0
    • 7.0.0, 7.0.1, 7.0.2, 7.0.3, 7.0.4, 7.1.4, 7.0.5, 7.1.0, 7.1.1, 7.1.2, 7.2.0, 7.1.3, 7.2.1, 7.1.5, 7.2.2
    • XDCR
    • Untriaged
    • 0
    • No

    Description

      Test 2 Source side

      2023-11-10T19:50:13.418Z INFO GOXDCR.CheckpointMgr: 9a26b5281c0bb68d3565f02dedbe13e3/b1/b2 Received rollback from DCP stream vb=190, rollbackseqno=0
      

      2023-11-10T19:50:13.202655+00:00 WARNING 70: (b1) DCP (Producer) eq_dcpq:xdcr:dcp_9a26b5281c0bb68d3565f02dedbe13e3/b1/b2_127.0.0.1:11210_1:eZKvS2DDiE69D9iifDADsw== - (vb:634) Stream request requires rollback to seqno:0 because vBucket UUID not found in failover table, consumer and producer have no common history. Client requested seqnos:{1,18446744073709551615} snapshot:{0,1} uuid:270389697156690
      

      We can see that VBUUID did indeed diverged.

      2023-11-10T19:50:14.816Z INFO GOXDCR.CheckpointMgr: 9a26b5281c0bb68d3565f02dedbe13e3/b1/b2 Rolled back startSeqno to 0 for vb=36
      2023-11-10T19:50:14.816Z INFO GOXDCR.CheckpointMgr: 9a26b5281c0bb68d3565f02dedbe13e3/b1/b2 Retry vbts=[vbno=36, uuid=27833190048882, seqno=0, sn_start=0, sn_end=0, srcManId=0, tgtManId=0]
      2023-11-10T19:51:43.604Z INFO GOXDCR.StatsMgr: 9a26b5281c0bb68d3565f02dedbe13e3/b1/b2 total_docs=0, docs_processed=0, changes_left=0
      

      After the rollback, we indeed see that the source bucket has no document per Ephemeral bucket failover.
      After this point, we need to examine the target.

      On the target, we see VBUUID changed from prior to failover:

      2023-11-10T19:43:56.815Z INFO GOXDCR.CheckpointMgr: high_seqno_and_vbuuid_map=map[0:[2 154426843721660] 1:[3 210277909393215] 2:[1 84684117587385] 3:[1 8290499089518] 4:[1 196910853336884] 5:[0 199740622769747] 6:[3 119872586875387] 7:[4 222892142227865] 8:[4 95022217696648] 9:[3 39012357378059] 10:[1 211788290813736] 11:[1 34673846444505] 12:[0 228661520770186] 13:[1 166863773917663] 14:[3 137415014364077] 15:[2 77241665392177] 16:[0 39318197108351] 17:[1 51085109810178] 18:[2 254941462668812] 19:[3 216767593795803] 20:[3 212991720192003] 21:[4 153567376942254] 22:[2 270555470095135] 23:[0 249286862309245] 24:[2 134403816601229] 25:[0 124606719495893] 26:[4 224351320372821] 27:[3 225902359461913] 28:[3 62391942643225] 29:[2 244965601549711] 30:[0 206364212072488] 31:[1 237853971712258] 32:[1 41949912824453] 33:[1 59002330479586] 34:[3 140832479357498] 35:[2 223724436483744] 36:[5 27833190048882] 37:[3 81878845313773] 38:[0 249417699123833] 39:[2 187641062205983] 40:[2 202554840050771] 41:[2 157974071400220] 42:[3 102706939829461] 43:[5 253384948749306
      

      with some connectivity issues in between:

      2023-11-10T19:53:56.776Z WARN GOXDCR.CheckpointMgr: MainPipeline cae0a545b06006cf27f38e63a52f3130/b2/b1 Error getting vbucket-seqno stats for serverAddr=172.31.45.240:11210. vbnos=[0 1 2 3 4 …1022 1023], err=write tcp 172.31.44.32:37796->172.31.45.240:11210: write: broken pipe
      

      to post failover:

      2023-11-10T19:53:57.020Z INFO GOXDCR.CheckpointMgr: high_seqno_and_vbuuid_map=map[0:[0 126957815739762] 1:[0 73656497180216] 2:[0 210882170864119] 3:[0 227669496888839] 4:[0 234892281058358] 5:[0 77855964368367] 6:[0 94592576861185] 7:[0 97404996130780] 8:[0 202473979135397] 9:[0 137334799062834] 10:[0 49715392751982] 11:[0 77039012834310] 12:[0 143315529511933] 13:[0 256444867542785] 14:[0 98071247603780] 15:[0 118524803845585] 16:[0 54603363440149] 17:[0 99076048686569] 18:[0 197290223413343] 19:[0 252515855428545] 20:[0 165203606207565]
      

      The odd thing is that this should have triggered a restart but it did not.

      Diving in deeper by comparing the logs between test 1 and test 2, I realized that the timing of the situation is different. Test 1 has the situation where checkpoint manager has not done any checkpointing at all, then the failover occurs on the target. For test 2, at least one successful checkpoint operation has occurred, then the failover happens on the target. That’s the difference I can see.

      This is most likely due to the fact that no mutation was written, meaning that the through sequence number did not move. So, XDCR checkpoint manager will skip the actual checkpointing operation. When it skips the checkpointing, it also skips the VBUUID check verification. I suspect that if one more mutation was written, it would cause VBUUID detection to occur.

      I was able to write a test and verify this theory and reproduce the issue, such that Checkpoint Manager was aware of the target vbuuid change but did not restart the pipeline.
      By writing one more document, we got the actual restart:

      2023-11-15T13:43:34.277-08:00 INFO GOXDCR.CheckpointMgr: pipelineFullTopic=5bc716dfb2b79f1429bf75a391164885/B1/B2, Errors encountered in checkpointing: map[406:target vbuuid has changed]
      2023-11-15T13:43:34.418-08:00 INFO GOXDCR.PipelineMgr: Replication Status=name={5bc716dfb2b79f1429bf75a391164885/B1/B2}, status={Pending}, errors={[{"time":"2023-11-15T13:43:34.118729-08:00","errMsg":"TopoChangeDet : target vbuuid has changed"}]}, oldProgress={Source nozzles have been closed}, progress={Pipeline has been stopped}
      

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              ayush.nayyar Ayush Nayyar
              neil.huang Neil Huang
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty