Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-61244

[CBBS] Backup service requires unnecessary rebalance

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • 7.6.2
    • 7.6.0, 7.2.0
    • tools
    • Untriaged
    • 0
    • No
    • Tools 2024-Q1

    Description

      TLDR: the backup service doesn't appear to keep track of the nodes in the cluster providing the backup service. Thus after a cluster reboot a rebalance is required.

      When a cluster is rebooted the eventing service returns just itself in "nodes" and "isBalanced" is true.  Because the number of "nodes" doesn't match what ns_server has eventing configured for we require a rebalance. To reproduce:

      • cluster_run -n 2 --dont-rename
      • cluster_connect -n 2 -s 1024 -I 512 -M plasma -T n0:kv+index+n1ql+fts+eventing+cbas+backup,n1:kv+index+n1ql+fts+eventing+cbas+backup
      • Log into UI and see that rebalance completes
      • CTRL^C in the window where cluster_run was run
      • cluster_run -n 2 --dont-rename

      At this point the /pools/default endpoint returns that "eventing" and "backup" require a rebalance.

        "balanced": false,
        "servicesNeedRebalance": [
          {
            "code": "service_not_balanced",
            "description": "Service needs rebalance.",
            "services": [
              "eventing",
              "backup"
            ]
          }
      

      The reason ns_server believes "backup" needs a rebalance is due to GetTopology responses from each of the two nodes includes just that node and indicates isBalanced is true. Here's the entries for the two nodes on my run (note each node only knows about itself):

      [json_rpc:debug,2024-03-20T14:46:03.158-07:00,n_0@127.0.0.1:json_rpc_connection-backup-service_api<0.1252.0>:json_rpc_connection:handle_info:107]got response: [{<<"id">>,2},
                     {<<"result">>,
                      {[{<<"rev">>,<<"AAAAAAAAAAI=">>},
                        {<<"nodes">>,[<<"1080b788c0e8115ce25ff93ed60cd4f1">>]},
                        {<<"isBalanced">>,true}]}},
                     {<<"error">>,null}]
      

      and the other node

      [json_rpc:debug,2024-03-20T14:46:03.164-07:00,n_1@127.0.0.1:json_rpc_connection-backup-service_api<0.1349.0>:json_rpc_connection:handle_info:107]got response: [{<<"id">>,2},
                     {<<"result">>,
                      {[{<<"rev">>,<<"AAAAAAAAAAI=">>},
                        {<<"nodes">>,[<<"16745cea9a733708f49fa44e1def4528">>]},
                        {<<"isBalanced">>,true}]}},
                     {<<"error">>,null}]
      

      As an example of what would be expected...this is after doing a rebalance from the UI.

      [json_rpc:debug,2024-03-20T14:48:36.190-07:00,n_0@127.0.0.1:json_rpc_connection-backup-service_api<0.1252.0>:json_rpc_connection:handle_info:107]got response: [{<<"id">>,31},
                     {<<"result">>,
                      {[{<<"rev">>,<<"AAAAAAAAAAc=">>},
                        {<<"nodes">>,
                         [<<"1080b788c0e8115ce25ff93ed60cd4f1">>,
                          <<"16745cea9a733708f49fa44e1def4528">>]},
                        {<<"isBalanced">>,true}]}},
                     {<<"error">>,null}]
      

      So it appears the backup service doesn't keep track of which nodes in the cluster provide the backup service.

      Attachments

        For Gerrit Dashboard: MB-61244
        # Subject Branch Project Status CR V

        Activity

          People

            gilad.kalchheim Gilad Kalchheim
            steve.watanabe Steve Watanabe
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:

              Gerrit Reviews

                There are 2 open Gerrit changes

                PagerDuty