Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-61507

Disconnected DCP connection not brought up until rebalance completes.

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • Morpheus
    • None
    • ns_server
    • None
    • Untriaged
    • 0
    • Unknown

    Description

      Observed the following from KV logs in a recent CBSE-16809.

      • The customer hit the issue seen in MB-60066 wherein the DCP producer connection for vb: 304, 315 (bucket: persist_care) got closed.

       
      2024-04-10T05:29:14.277985+00:00 ERROR 305: (care_persist) DCP (Producer) eq_dcpq:replication:ns_1@cb-01-0080.cb-01.prod1-backingservices.svc->ns_1@cb-01-0096.cb-01.prod1-backingservices.svc:care_pe
      rsist - DcpProducer::handleResponse disconnecting, received unexpected response:{"bodylen":0,"cas":0,"datatype":"raw","extlen":0,"keylen":0,"magic":"ClientResponse","opaque":4,"opcode":"DCP_DELETION
      ","status":"Too large"} for stream:stream name:eq_dcpq:replication:ns_1@cb-01-0080.cb-01.prod1-backingservices.svc->ns_1@cb-01-0096.cb-01.prod1-backingservices.svc:care_persist, vb:315, state:in-mem
      ory
      
      

      • The producer connection was brought back by ns_server ~8 mins later.

       
      2024-04-10T05:37:52.414867+00:00 INFO 2099: (care_persist) DCP (Producer) eq_dcpq:replication:ns_1@cb-01-0080.cb-01.prod1-backingservices.svc->ns_1@cb-01-0096.cb-01.prod1-backingservices.svc:care_pe
      rsist - (vb:304) Creating stream with start seqno 107511159 and end seqno 18446744073709551615; requested end seqno was 18446744073709551615, snapshot:\{107511159,107511159} collections-filter-size:0
       sid:none
      
      

      • syncWrites were failing in this ~8 mins window.

       
      2024-04-10T05:30:44.824035+00:00 WARNING 1887: Slow operation: {"bucket":"care_persist","cid":"8481760300000001/00000000C940F6FB/1123192","command":"ADD","duration":"54 s","packet":{"bodylen":3906,"
       
      ...
       
      2024-04-10T05:37:12.159847+00:00 WARNING 1564: Slow operation: {"bucket":"care_persist","cid":"0944709100000001/0000000018A5C9B4/16612a4","command":"ADD","duration":"54 s","packet":{"bodylen":6314,"
      
      

      • On checking why ns_server didn't bring up the producer connection immediately, I noticed there was rebalance running in the above 8 mins window which eventually failed & the replication streams was brought up after the completion of that rebalance (in a janitor run).

      I would have presumed the supervisor for the dcp_replicator process in ns_server should have brought up the producer connection back up immediately post the initial disconnect since the erlang process would have crashed with a non-normal exit.

       
      [error_logger:error,2024-04-09T05:37:49.777Z,ns_1@cb-01-0080.cb-01.prod1-backingservices.svc:dcp_sup-care_persist<0.18185.7>:ale_error_logger_handler:do_log:101]
      =========================SUPERVISOR REPORT=========================
          supervisor: \{local,'dcp_sup-care_persist'}
          errorContext: child_terminated
          reason: socket_closed
          offender: [\{pid,<0.23325.2967>},
                     {id,{'ns_1@cb-01-0090.cb-01.prod1-backingservices.svc',
                             [collections,del_times,del_user_xattr,json,
                              set_consumer_name,snappy,xattr]}},
                     \{mfargs,{dcp_replicator,start_link,undefined}},
                     \{restart_type,temporary},
                     \{significant,false},
                     \{shutdown,60000},
                     \{child_type,worker}]
      
      

      Ben Huddleston looked at the code from ns_server and pointed out the restart policy was temporary for the dcp_replicator process.

      I think we should bring these connections up immediately (based on the most up to date local view of the topology on consumer node where dcp_proxy runs). Don't though if there is some history to it as to why restart policy is temporary for these replicator process.

      Logs:

      There was no separate CBSE opened for this issue & was being triaged as the issue was happening in the customer's cluster - but the relevant logs and discussion for the syncWrite failures are here:

       https://couchbase.slack.com/archives/C06T4RDAMCJ/p1712771184069949

      http://supportal.couchbase.com/snapshot/19bd2efe90e6e07d79eb8cb5ab4bdeed::109

      Jim Walker noticed the same issue in CBSE-16848 as well.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            Abhijeeth.Nuthan Abhijeeth Nuthan
            hareen.kancharla Hareen Kancharla
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty