Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-53081

DCP Consumers can successfully StreamRequest from KV node in warmup which is about-to-be failed over



    • Bug
    • Resolution: Unresolved
    • Major
    • Morpheus
    • 6.6.2
    • couchbase-bucket
    • None
    • Untriaged
    • 1
    • Unknown


      This MB is spun out from MB-53076, when the following sequence showed that Indexing managed to reconnect to a KV node and perform DCP StreamRequests before the node had trafficEnabled by ns_server - and crucially a few seconds before the KV node was failed over and vBucket promoted elsewhere:

      a. At 2022-07-21T17:39:12.120-07:00, memcached exited

      2022-07-21T17:39:12.120-07:00, ns_log:0:info:message(ns_1@ - Service 'memcached' exited with status 137. Restarting. Messages:

      b. Due to memcached exit, indexer received a streamEnd. Considering vb:141 for our analysis, the last received seqno. is 4168

      2022-07-21T17:39:12.111-07:00 [Info] TK StreamEnd MAINT_STREAM test3 141 77409272469274 4168

      c. Indexer triggered stream repair due to stream end. For vb:141, indexer restarted the stream with last received seqno. 4168 and a snapshot of {4163, 4168}

      2022-07-21T17:39:16.284-07:00 [Info] Indexer::handleKVStreamRepair Initiate Stream Repair MAINT_STREAM Bucket test3 StreamId 1
      2022-07-21T17:39:16.284-07:00 [Info] Indexer::startBucketStream Stream: MAINT_STREAM Bucket: test3 SessionId 1 RestartTS bucket: test3, vbuckets: 1024 Crc64: 0 snapType NO_SNAP -
          {vbno, vbuuid, seqno, snapshot-start, snapshot-end}
          {  141     46674030171a       4168       4163       4168}

      d.For this stream request, memcached (on node 100.15) asked projector to rollback to 4109 

      2022-07-21T17:39:17.706652-07:00 WARNING 382: (test3) DCP (Producer) eq_dcpq:secidx:proj-test3-MAINT_STREAM_TOPIC_0765f2d280c663c382dea1a1b49ab1cd-2556451229124417097/1 - (vb:141) Stream request requires rollback to seqno:4109 because consumer ahead of producer - producer upper at 4109. Client requested seqnos:{4168,18446744073709551615} snapshot:{4163,4168} uuid:77409272469274

      e. After this, the KV node 100.15 failed over

      2022-07-21T17:39:18.234-07:00, failover:0:info:message(ns_1@ - Starting failing over ['ns_1@'] 
      2022-07-21T17:39:19.308-07:00, ns_orchestrator:0:info:message(ns_1@ - Failover completed successfully.

      ... Indexer goes on to connect to new node holding vb:141, issues StreamRequest, is told to rollback again ...

      The problem here is that Indexing has had to perform an extra rollback / repair based on VBucket UUID & sequence numbers on the "just crashed" active node, before ns_server had decided this node is healthy and enabled traffic on it.

      I believe this is a long-standing behaviour in KV - from the comments in the code:

       * KV-engine has the following behaviour as warmup runs.
       * Whilst the following phases are incomplete:
       *    Initialise
       *    CreateVBuckets
       *    LoadingCollectionCounts
       *    EstimateDatabaseItemCount
       *    LoadPreparedSyncWrites
       *    PopulateVBucketMap
       *  1) setVBucket requests are queued (using the EWOULDBLOCK mechanism)
       *  2) The VBucket map is empty. No operation can find a VBucket object to
       *     operate on. For CRUD operations externally clients will see
       *     'temporary_failure' instead of 'not_my_vbucket'.
       *  3) DCP Consumers cannot be created.
       *  4) DCP Producers can be created, but stream-request will fail with
       *     'not_my_vbucket'.
       * On completion of PopulateVBucketMap:
       *  1) All queued setVBucket requests are notified and all new setVBucket
       *     requests can be processed.
       *  2) The VBucket map is populated and operations can now find VBucket objects.
       *     * DCP producer stream-requests can now be processed.

      Once PopulateVBucketMap step has completed, DCP StreamRequests can be (successfully) processed.

      While this is desirable in terms of bringing replicas / other DCP consumers back online and consistent asap, it does have the downside seen the the above scenario - a DCP consumer could end up having to discard data / perform costly updates of it's own state based on information from a node which is not "ready" from ns_server's POV and is soon-to-be failed over.




        No reviews matched the request. Check your Options in the drop-down menu of this sections header.



            owend Daniel Owen
            drigby Dave Rigby (Inactive)
            0 Vote for this issue
            6 Start watching this issue



              Gerrit Reviews

                There are no open Gerrit changes