Details
-
Bug
-
Resolution: Duplicate
-
Critical
-
None
-
7.0.3
-
None
-
Untriaged
-
1
-
Unknown
Description
In a recent case from the field we saw rebalance fail for the following reason:
2022-04-19T16:34:51.847-07:00, [ns_orchestrator:0:critical:message(ns_1@172.23.104.203) - Rebalance exited with reason {service_rebalance_failed,backup,
|
{agent_died,<31216.25955.386>,
|
{linked_process_died,<31216.17073.386>,
|
{'ns_1@172.23.120.109',
|
{{badmatch,
|
{false,
|
{topology,[],
|
[<<"8dccfe5b6d428dd31a33c901f3b3bfc3">>],
|
false,[]},
|
{topology,[],
|
[<<"20aab8db8a8400f79260bbc031acb796">>,
|
<<"8dccfe5b6d428dd31a33c901f3b3bfc3">>],
|
true,[]}}},
|
[{service_agent,long_poll_worker_loop,5,
|
[{file,"src/service_agent.erl"},
|
{line,654}]},
|
{proc_lib,init_p,3,
|
[{file,"proc_lib.erl"},{line,234}]}]}}}}}.
|
This occurs at this point in the code: https://github.com/couchbase/ns_server/blob/cheshire-cat/src/service_agent.erl#L654. What's happening is that ns_server is querying the state of the topology in the service and the service is required always return the same value of the topology for a given revision. If the revision is the same the topology must be the same – and ns_server asserts this is true. The fact that this assertion fails is a problem as it causes the rebalance to fail and there isn't a good way to fix it.
Note that the failure occurred at the end of a rebalance where one of the Backup service nodes was getting rebalanced out – 20aab8db8a8400f79260bbc031acb796 in fact. Backup returned the update topology only containing the node 8dccfe5b6d428dd31a33c901f3b3bfc3 – however, it had the same revision as its previous response to ns_server, which is apparently the cause of the problem.
Attachments
Issue Links
- duplicates
-
MB-50839 Backup: Service rebalance failed with reason "service_rebalance_failed" during service_agent,long_poll_worker_loop
- Closed