Details
-
Bug
-
Resolution: Fixed
-
Critical
-
7.0.3, 7.0.4
-
Untriaged
-
1
-
Unknown
Description
As seen in a recent situation in the field, the Backup Service can get stuck in some fashion. This results in ns_server dropping the connection due to a timeout and waiting for a reconnect from the Backup Service, which never happens. The JSON RPC connection process gets restarted and times out again and this continues for many hours.
A message is seen like this:
[ns_server:error,2022-04-19T17:02:50.904-07:00,ns_1@172.23.120.100:service_agent-backup<0.24218.3688>:service_agent:handle_info:277]Linked process <0.21373.3688> died with reason {no_connection,
|
"backup-service_api"}. Terminating
|
And this continues 60 times an hour for many hours:
$ grep "service_agent.*no_connection" ns_server.debug.log | grep -E -o 2022-..-..T.. | uniq -c
|
58 2022-04-19T17
|
60 2022-04-19T18
|
60 2022-04-19T19
|
60 2022-04-19T20
|
60 2022-04-19T21
|
60 2022-04-19T22
|
60 2022-04-19T23
|
60 2022-04-20T00
|
60 2022-04-20T01
|
60 2022-04-20T02
|
60 2022-04-20T03
|
60 2022-04-20T04
|
60 2022-04-20T05
|
60 2022-04-20T06
|
60 2022-04-20T07
|
60 2022-04-20T08
|
60 2022-04-20T09
|
60 2022-04-20T10
|
1 2022-04-20T11
|
The code that re-establishes the connection with ns_server is here: https://github.com/couchbase/cbauth/blob/94cdd4fa943bb2107f48238bb563e5bc71b73df5/revrpc/revrpc.go#L282. But for some reason, when the connection drops something prevents the reconnection from happening.
Attachments
Issue Links
- relates to
-
MB-50839 Backup: Service rebalance failed with reason "service_rebalance_failed" during service_agent,long_poll_worker_loop
- Closed