Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: Morpheus
Affects Version/s: None
Component/s: ns_server
Labels:
None

Triage:
Untriaged
Story Points:
0
Is this a Regression?:
Unknown

Description

Observed the following from KV logs in a recent CBSE-16809.

The customer hit the issue seen in ~~MB-60066~~ wherein the DCP producer connection for vb: 304, 315 (bucket: persist_care) got closed.

2024-04-10T05:29:14.277985+00:00 ERROR 305: (care_persist) DCP (Producer) eq_dcpq:replication:ns_1@cb-01-0080.cb-01.prod1-backingservices.svc->ns_1@cb-01-0096.cb-01.prod1-backingservices.svc:care_pe

rsist - DcpProducer::handleResponse disconnecting, received unexpected response:{"bodylen":0,"cas":0,"datatype":"raw","extlen":0,"keylen":0,"magic":"ClientResponse","opaque":4,"opcode":"DCP_DELETION

","status":"Too large"} for stream:stream name:eq_dcpq:replication:ns_1@cb-01-0080.cb-01.prod1-backingservices.svc->ns_1@cb-01-0096.cb-01.prod1-backingservices.svc:care_persist, vb:315, state:in-mem

ory

The producer connection was brought back by ns_server ~8 mins later.

2024-04-10T05:37:52.414867+00:00 INFO 2099: (care_persist) DCP (Producer) eq_dcpq:replication:ns_1@cb-01-0080.cb-01.prod1-backingservices.svc->ns_1@cb-01-0096.cb-01.prod1-backingservices.svc:care_pe

rsist - (vb:304) Creating stream with start seqno 107511159 and end seqno 18446744073709551615; requested end seqno was 18446744073709551615, snapshot:\{107511159,107511159} collections-filter-size:0

 sid:none

syncWrites were failing in this ~8 mins window.

2024-04-10T05:30:44.824035+00:00 WARNING 1887: Slow operation: {"bucket":"care_persist","cid":"8481760300000001/00000000C940F6FB/1123192","command":"ADD","duration":"54 s","packet":{"bodylen":3906,"

...

2024-04-10T05:37:12.159847+00:00 WARNING 1564: Slow operation: {"bucket":"care_persist","cid":"0944709100000001/0000000018A5C9B4/16612a4","command":"ADD","duration":"54 s","packet":{"bodylen":6314,"

On checking why ns_server didn't bring up the producer connection immediately, I noticed there was rebalance running in the above 8 mins window which eventually failed & the replication streams was brought up after the completion of that rebalance (in a janitor run).

I would have presumed the supervisor for the dcp_replicator process in ns_server should have brought up the producer connection back up immediately post the initial disconnect since the erlang process would have crashed with a non-normal exit.

[error_logger:error,2024-04-09T05:37:49.777Z,ns_1@cb-01-0080.cb-01.prod1-backingservices.svc:dcp_sup-care_persist<0.18185.7>:ale_error_logger_handler:do_log:101]

=========================SUPERVISOR REPORT=========================

    supervisor: \{local,'dcp_sup-care_persist'}

    errorContext: child_terminated

    reason: socket_closed

    offender: [\{pid,<0.23325.2967>},

               {id,{'ns_1@cb-01-0090.cb-01.prod1-backingservices.svc',

                       [collections,del_times,del_user_xattr,json,

                        set_consumer_name,snappy,xattr]}},

               \{mfargs,{dcp_replicator,start_link,undefined}},

               \{restart_type,temporary},

               \{significant,false},

               \{shutdown,60000},

               \{child_type,worker}]

Ben Huddleston looked at the code from ns_server and pointed out the restart policy was temporary for the dcp_replicator process.

I think we should bring these connections up immediately (based on the most up to date local view of the topology on consumer node where dcp_proxy runs). Don't though if there is some history to it as to why restart policy is temporary for these replicator process.

Logs:

There was no separate CBSE opened for this issue & was being triaged as the issue was happening in the customer's cluster - but the relevant logs and discussion for the syncWrite failures are here:

https://couchbase.slack.com/archives/C06T4RDAMCJ/p1712771184069949

http://supportal.couchbase.com/snapshot/19bd2efe90e6e07d79eb8cb5ab4bdeed::109

Jim Walker noticed the same issue in CBSE-16848 as well.

Attachments

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Abhijeeth Nuthan

Reporter:: Hareen Kancharla

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 15/Apr/24 9:57 AM

Updated:: 29/Apr/24 9:36 AM

Gerrit Reviews

There are no open Gerrit changes

Disconnected DCP connection not brought up until rebalance completes.

Details

Description

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty