Description
Observed the following from KV logs in a recent CBSE-16809.
- The customer hit the issue seen in
MB-60066wherein the DCP producer connection for vb: 304, 315 (bucket: persist_care) got closed.
|
2024-04-10T05:29:14.277985+00:00 ERROR 305: (care_persist) DCP (Producer) eq_dcpq:replication:ns_1@cb-01-0080.cb-01.prod1-backingservices.svc->ns_1@cb-01-0096.cb-01.prod1-backingservices.svc:care_pe
|
rsist - DcpProducer::handleResponse disconnecting, received unexpected response:{"bodylen":0,"cas":0,"datatype":"raw","extlen":0,"keylen":0,"magic":"ClientResponse","opaque":4,"opcode":"DCP_DELETION
|
","status":"Too large"} for stream:stream name:eq_dcpq:replication:ns_1@cb-01-0080.cb-01.prod1-backingservices.svc->ns_1@cb-01-0096.cb-01.prod1-backingservices.svc:care_persist, vb:315, state:in-mem
|
ory
|
|
- The producer connection was brought back by ns_server ~8 mins later.
|
2024-04-10T05:37:52.414867+00:00 INFO 2099: (care_persist) DCP (Producer) eq_dcpq:replication:ns_1@cb-01-0080.cb-01.prod1-backingservices.svc->ns_1@cb-01-0096.cb-01.prod1-backingservices.svc:care_pe
|
rsist - (vb:304) Creating stream with start seqno 107511159 and end seqno 18446744073709551615; requested end seqno was 18446744073709551615, snapshot:\{107511159,107511159} collections-filter-size:0
|
sid:none
|
|
- syncWrites were failing in this ~8 mins window.
|
2024-04-10T05:30:44.824035+00:00 WARNING 1887: Slow operation: {"bucket":"care_persist","cid":"8481760300000001/00000000C940F6FB/1123192","command":"ADD","duration":"54 s","packet":{"bodylen":3906,"
|
|
...
|
|
2024-04-10T05:37:12.159847+00:00 WARNING 1564: Slow operation: {"bucket":"care_persist","cid":"0944709100000001/0000000018A5C9B4/16612a4","command":"ADD","duration":"54 s","packet":{"bodylen":6314,"
|
|
- On checking why ns_server didn't bring up the producer connection immediately, I noticed there was rebalance running in the above 8 mins window which eventually failed & the replication streams was brought up after the completion of that rebalance (in a janitor run).
I would have presumed the supervisor for the dcp_replicator process in ns_server should have brought up the producer connection back up immediately post the initial disconnect since the erlang process would have crashed with a non-normal exit.
|
[error_logger:error,2024-04-09T05:37:49.777Z,ns_1@cb-01-0080.cb-01.prod1-backingservices.svc:dcp_sup-care_persist<0.18185.7>:ale_error_logger_handler:do_log:101]
|
=========================SUPERVISOR REPORT=========================
|
supervisor: \{local,'dcp_sup-care_persist'}
|
errorContext: child_terminated
|
reason: socket_closed
|
offender: [\{pid,<0.23325.2967>},
|
{id,{'ns_1@cb-01-0090.cb-01.prod1-backingservices.svc',
|
[collections,del_times,del_user_xattr,json,
|
set_consumer_name,snappy,xattr]}},
|
\{mfargs,{dcp_replicator,start_link,undefined}},
|
\{restart_type,temporary},
|
\{significant,false},
|
\{shutdown,60000},
|
\{child_type,worker}]
|
|
Ben Huddleston looked at the code from ns_server and pointed out the restart policy was temporary for the dcp_replicator process.
I think we should bring these connections up immediately (based on the most up to date local view of the topology on consumer node where dcp_proxy runs). Don't though if there is some history to it as to why restart policy is temporary for these replicator process.
Logs:
There was no separate CBSE opened for this issue & was being triaged as the issue was happening in the customer's cluster - but the relevant logs and discussion for the syncWrite failures are here:
https://couchbase.slack.com/archives/C06T4RDAMCJ/p1712771184069949
http://supportal.couchbase.com/snapshot/19bd2efe90e6e07d79eb8cb5ab4bdeed::109
Jim Walker noticed the same issue in CBSE-16848 as well.