Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-51891

Backup Service reports updated topology using same revision to ns_server at end of rebalance

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • 7.0.3
    • tools
    • None
    • Untriaged
    • 1
    • Unknown

    Description

      In a recent case from the field we saw rebalance fail for the following reason:

      2022-04-19T16:34:51.847-07:00, [ns_orchestrator:0:critical:message(ns_1@172.23.104.203) - Rebalance exited with reason {service_rebalance_failed,backup,
                                    {agent_died,<31216.25955.386>,
                                     {linked_process_died,<31216.17073.386>,
                                      {'ns_1@172.23.120.109',
                                       {{badmatch,
                                         {false,
                                          {topology,[],
                                           [<<"8dccfe5b6d428dd31a33c901f3b3bfc3">>],
                                           false,[]},
                                          {topology,[],
                                           [<<"20aab8db8a8400f79260bbc031acb796">>,
                                            <<"8dccfe5b6d428dd31a33c901f3b3bfc3">>],
                                           true,[]}}},
                                        [{service_agent,long_poll_worker_loop,5,
                                          [{file,"src/service_agent.erl"},
                                           {line,654}]},
                                         {proc_lib,init_p,3,
                                          [{file,"proc_lib.erl"},{line,234}]}]}}}}}.
      

      This occurs at this point in the code: https://github.com/couchbase/ns_server/blob/cheshire-cat/src/service_agent.erl#L654. What's happening is that ns_server is querying the state of the topology in the service and the service is required always return the same value of the topology for a given revision. If the revision is the same the topology must be the same – and ns_server asserts this is true. The fact that this assertion fails is a problem as it causes the rebalance to fail and there isn't a good way to fix it.

      Note that the failure occurred at the end of a rebalance where one of the Backup service nodes was getting rebalanced out – 20aab8db8a8400f79260bbc031acb796 in fact. Backup returned the update topology only containing the node 8dccfe5b6d428dd31a33c901f3b3bfc3 – however, it had the same revision as its previous response to ns_server, which is apparently the cause of the problem.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              maks.januska Maksimiljans Januska
              dfinlay Dave Finlay
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty