Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Duplicate
Priority: Critical
Fix Version/s: None
Affects Version/s: 7.0.3
Component/s: tools
Labels:
None

Triage:
Untriaged
Story Points:
1
Is this a Regression?:
Unknown

Description

In a recent case from the field we saw rebalance fail for the following reason:

2022-04-19T16:34:51.847-07:00, [ns_orchestrator:0:critical:message(ns_1@172.23.104.203) - Rebalance exited with reason {service_rebalance_failed,backup,

                              {agent_died,<31216.25955.386>,

                               {linked_process_died,<31216.17073.386>,

                                {'ns_1@172.23.120.109',

                                 {{badmatch,

                                   {false,

                                    {topology,[],

                                     [<<"8dccfe5b6d428dd31a33c901f3b3bfc3">>],

                                     false,[]},

                                    {topology,[],

                                     [<<"20aab8db8a8400f79260bbc031acb796">>,

                                      <<"8dccfe5b6d428dd31a33c901f3b3bfc3">>],

                                     true,[]}}},

                                  [{service_agent,long_poll_worker_loop,5,

                                    [{file,"src/service_agent.erl"},

                                     {line,654}]},

                                   {proc_lib,init_p,3,

                                    [{file,"proc_lib.erl"},{line,234}]}]}}}}}.

This occurs at this point in the code: https://github.com/couchbase/ns_server/blob/cheshire-cat/src/service_agent.erl#L654. What's happening is that ns_server is querying the state of the topology in the service and the service is required always return the same value of the topology for a given revision. If the revision is the same the topology must be the same – and ns_server asserts this is true. The fact that this assertion fails is a problem as it causes the rebalance to fail and there isn't a good way to fix it.

Note that the failure occurred at the end of a rebalance where one of the Backup service nodes was getting rebalanced out – 20aab8db8a8400f79260bbc031acb796 in fact. Backup returned the update topology only containing the node 8dccfe5b6d428dd31a33c901f3b3bfc3 – however, it had the same revision as its previous response to ns_server, which is apparently the cause of the problem.

Attachments

Issue Links

duplicates

MB-50839 Backup: Service rebalance failed with reason "service_rebalance_failed" during service_agent,long_poll_worker_loop

Closed

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Maksimiljans Januska

Reporter:: Dave Finlay

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 20/Apr/22 6:23 PM

Updated:: 22/Apr/22 5:19 PM

Resolved:: 22/Apr/22 5:13 AM

Gerrit Reviews

There are no open Gerrit changes

Backup Service reports updated topology using same revision to ns_server at end of rebalance

Details

Description

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty