Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-48077

/pools/default erroneously reports unbalanced cluster

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 7.1.0
    • 7.1.0
    • query

    Description

      Hi,

      When creating (or upgrading to) an operator deployment using server 7.1.0-1169 (the most recently available docker image), operator repeatedly tries to rebalance because /pools/default responds with "balanced": false (full response attached). However, the server UI reports that the rebalance was successful (image & rebalance report attached).

      This happens with a basic operator deployment - no buckets, data, etc. on the cluster, with no interaction from me. I have tested further and can confirm this does not happen doing the same thing (on the same Operator deployment) with server 6.6.3, 7.0.0, 7.0.1, or 7.0.2, leading me to suspect it may be an issue with server.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            Roo Thorp: Quick clarification question, Do you see "rebalance" set to false much after the rebalance has finished?

            "operator repeatedly tries to rebalance because /pools/default responds with "balanced": false"

            How frequently and how often is the operator calling the pools/default API?

            hareen.kancharla Hareen Kancharla added a comment - Roo Thorp : Quick clarification question, Do you see "rebalance" set to false much after the rebalance has finished? "operator repeatedly tries to rebalance because /pools/default responds with "balanced": false" How frequently and how often is the operator calling the pools/default API?
            roo.thorp Roo Thorp added a comment -

            Hi Hareen Kancharla,

            Looking at the logs, it looks like when not rebalancing we query pool/default every ~3 seconds, and during a rebalance its every ~1 second.

            We always see "balanced":false - in the logs I cannot find an instance of this being true.

            roo.thorp Roo Thorp added a comment - Hi Hareen Kancharla , Looking at the logs, it looks like when not rebalancing we query pool/default every ~3 seconds, and during a rebalance its every ~1 second. We always see "balanced":false - in the logs I cannot find an instance of this being true.

            Thanks Roo Thorp. Let me dig into it further.

            hareen.kancharla Hareen Kancharla added a comment - Thanks Roo Thorp . Let me dig into it further.
            hareen.kancharla Hareen Kancharla added a comment - - edited

            From the logs and code I see:

            1) pools/default handler gets the status of all the services from ns_doctor. In ns_doctor logs I see the service_status of n1ql is always "needs_rebalance: true"

            [ns_doctor:debug,2021-08-20T14:27:57.190Z,ns_1@cb-example-0000.cb-example.default.svc:ns_doctor<0.863.0>:ns_doctor:handle_info:184]Current node statuses:
            [{'ns_1@cb-example-0000.cb-example.default.svc',
                  {{service_status,n1ql},[{connected,true},{needs_rebalance,true}]},
            [ns_doctor:debug,2021-08-20T14:28:57.227Z,ns_1@cb-example-0000.cb-example.default.svc:ns_doctor<0.863.0>:ns_doctor:handle_info:184]Current node statuses:
                  {{service_status,n1ql},[{connected,true},{needs_rebalance,true}]},
            [ns_doctor:debug,2021-08-20T14:29:57.230Z,ns_1@cb-example-0000.cb-example.default.svc:ns_doctor<0.863.0>:ns_doctor:handle_info:184]Current node statuses:
                  {{service_status,n1ql},[{connected,true},{needs_rebalance,true}]}, 
            [ns_doctor:debug,2021-08-20T14:31:57.249Z,ns_1@cb-example-0000.cb-example.default.svc:ns_doctor<0.863.0>:ns_doctor:handle_info:184]Current node statuses:
                  {{service_status,n1ql},[{connected,true},{needs_rebalance,true}]}, 
            

            2) ns_doctor service_status is updated by the n1ql service_agent which gets incorrect json rpc response for "GetCurrentTopology" from N1QL service. We expect the "nodes" in the response below to be node UUID's, but we receive the node names in the response.

            [json_rpc:debug,2021-08-20T14:32:44.309Z,ns_1@cb-example-0000.cb-example.default.svc:json_rpc_connection-n1ql-service_api<0.1086.0>:json_rpc_connection:handle_call:152]sending jsonrpc call:{[{jsonrpc,<<"2.0">>},
                                   {id,290},
                                   {method,<<"ServiceAPI.GetCurrentTopology">>},
                                   {params,[{[{rev,null},{timeout,30000}]}]}]}
            [json_rpc:debug,2021-08-20T14:32:44.312Z,ns_1@cb-example-0000.cb-example.default.svc:json_rpc_connection-n1ql-service_api<0.1086.0>:json_rpc_connection:handle_info:88]got response: [{<<"id">>,290},
                           {<<"result">>,
                            {[{<<"rev">>,<<"AAAAAAAAABg=">>},
                              {<<"nodes">>,
                               [<<"cb-example-0000.cb-example.default.svc:8091">>,  ##### HK: These should be node UUIDs and not node names.
                                <<"cb-example-0001.cb-example.default.svc:8091">>,
                                <<"cb-example-0002.cb-example.default.svc:8091">>]},
                              {<<"isBalanced">>,true}]}},
                           {<<"error">>,null}]
            [json_rpc:debug,2021-08-20T14:32:44.315Z,ns_1@cb-example-0000.cb-example.default.svc:json_rpc_connection-n1ql-service_api<0.1086.0>:json_rpc_connection:handle_call:152]sending jsonrpc call:{[{jsonrpc,<<"2.0">>},
                                   {id,294},
                                   {method,<<"ServiceAPI.GetCurrentTopology">>},
                                   {params,[{[{rev,null},{timeout,30000}]}]}]}
            [json_rpc:debug,2021-08-20T14:32:44.316Z,ns_1@cb-example-0000.cb-example.default.svc:json_rpc_connection-n1ql-service_api<0.1086.0>:json_rpc_connection:handle_info:88]got response: [{<<"id">>,294},
                           {<<"result">>,
                            {[{<<"rev">>,<<"AAAAAAAAABg=">>},
                              {<<"nodes">>,
                               [<<"cb-example-0000.cb-example.default.svc:8091">>,         ###### HK: we expect these to be UUID's. 
                                <<"cb-example-0001.cb-example.default.svc:8091">>,
                                <<"cb-example-0002.cb-example.default.svc:8091">>]},
                              {<<"isBalanced">>,true}]}},
                           {<<"error">>,null}]
            

            Moving the ticket to the N1QL team to take a look at it further.

            hareen.kancharla Hareen Kancharla added a comment - - edited From the logs and code I see: 1) pools/default handler gets the status of all the services from ns_doctor. In ns_doctor logs I see the service_status of n1ql is always "needs_rebalance: true" [ns_doctor:debug,2021-08-20T14:27:57.190Z,ns_1@cb-example-0000.cb-example.default.svc:ns_doctor<0.863.0>:ns_doctor:handle_info:184]Current node statuses: [{'ns_1@cb-example-0000.cb-example.default.svc', {{service_status,n1ql},[{connected,true},{needs_rebalance,true}]}, [ns_doctor:debug,2021-08-20T14:28:57.227Z,ns_1@cb-example-0000.cb-example.default.svc:ns_doctor<0.863.0>:ns_doctor:handle_info:184]Current node statuses: {{service_status,n1ql},[{connected,true},{needs_rebalance,true}]}, [ns_doctor:debug,2021-08-20T14:29:57.230Z,ns_1@cb-example-0000.cb-example.default.svc:ns_doctor<0.863.0>:ns_doctor:handle_info:184]Current node statuses: {{service_status,n1ql},[{connected,true},{needs_rebalance,true}]}, [ns_doctor:debug,2021-08-20T14:31:57.249Z,ns_1@cb-example-0000.cb-example.default.svc:ns_doctor<0.863.0>:ns_doctor:handle_info:184]Current node statuses: {{service_status,n1ql},[{connected,true},{needs_rebalance,true}]}, 2) ns_doctor service_status is updated by the n1ql service_agent which gets incorrect json rpc response for "GetCurrentTopology" from N1QL service. We expect the "nodes" in the response below to be node UUID's, but we receive the node names in the response. [json_rpc:debug,2021-08-20T14:32:44.309Z,ns_1@cb-example-0000.cb-example.default.svc:json_rpc_connection-n1ql-service_api<0.1086.0>:json_rpc_connection:handle_call:152]sending jsonrpc call:{[{jsonrpc,<<"2.0">>}, {id,290}, {method,<<"ServiceAPI.GetCurrentTopology">>}, {params,[{[{rev,null},{timeout,30000}]}]}]} [json_rpc:debug,2021-08-20T14:32:44.312Z,ns_1@cb-example-0000.cb-example.default.svc:json_rpc_connection-n1ql-service_api<0.1086.0>:json_rpc_connection:handle_info:88]got response: [{<<"id">>,290}, {<<"result">>, {[{<<"rev">>,<<"AAAAAAAAABg=">>}, {<<"nodes">>, [<<"cb-example-0000.cb-example.default.svc:8091">>, ##### HK: These should be node UUIDs and not node names. <<"cb-example-0001.cb-example.default.svc:8091">>, <<"cb-example-0002.cb-example.default.svc:8091">>]}, {<<"isBalanced">>,true}]}}, {<<"error">>,null}] [json_rpc:debug,2021-08-20T14:32:44.315Z,ns_1@cb-example-0000.cb-example.default.svc:json_rpc_connection-n1ql-service_api<0.1086.0>:json_rpc_connection:handle_call:152]sending jsonrpc call:{[{jsonrpc,<<"2.0">>}, {id,294}, {method,<<"ServiceAPI.GetCurrentTopology">>}, {params,[{[{rev,null},{timeout,30000}]}]}]} [json_rpc:debug,2021-08-20T14:32:44.316Z,ns_1@cb-example-0000.cb-example.default.svc:json_rpc_connection-n1ql-service_api<0.1086.0>:json_rpc_connection:handle_info:88]got response: [{<<"id">>,294}, {<<"result">>, {[{<<"rev">>,<<"AAAAAAAAABg=">>}, {<<"nodes">>, [<<"cb-example-0000.cb-example.default.svc:8091">>, ###### HK: we expect these to be UUID's. <<"cb-example-0001.cb-example.default.svc:8091">>, <<"cb-example-0002.cb-example.default.svc:8091">>]}, {<<"isBalanced">>,true}]}}, {<<"error">>,null}] Moving the ticket to the N1QL team to take a look at it further.

            Build couchbase-server-7.1.0-1198 contains query commit 006703e with commit message:
            MB-48077 Report topology as UUIDs not host names

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1198 contains query commit 006703e with commit message: MB-48077 Report topology as UUIDs not host names

            Should be able to verify by simply repeating the testing using an appropriate build. (couchbase-server-7.1.0-1198 or later)

            Donald.haggart Donald Haggart added a comment - Should be able to verify by simply repeating the testing using an appropriate build. (couchbase-server-7.1.0-1198 or later)
            roo.thorp Roo Thorp added a comment -

            Hi Donald Haggart,

            Thanks for this! I'll need to wait a bit for this build to be docker-ized, but when it's available I'll check and update the ticket. Thanks!

            roo.thorp Roo Thorp added a comment - Hi Donald Haggart , Thanks for this! I'll need to wait a bit for this build to be docker-ized, but when it's available I'll check and update the ticket. Thanks!

            Roo Thorp is this resolved, if so can you close. Thanks.

            pierre.regazzoni Pierre Regazzoni added a comment - Roo Thorp is this resolved, if so can you close. Thanks.

            People

              roo.thorp Roo Thorp
              roo.thorp Roo Thorp
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty