Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-48077

/pools/default erroneously reports unbalanced cluster

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: Neo
    • Fix Version/s: Neo
    • Component/s: query
    • Labels:

      Description

      Hi,

      When creating (or upgrading to) an operator deployment using server 7.1.0-1169 (the most recently available docker image), operator repeatedly tries to rebalance because /pools/default responds with "balanced": false (full response attached). However, the server UI reports that the rebalance was successful (image & rebalance report attached).

      This happens with a basic operator deployment - no buckets, data, etc. on the cluster, with no interaction from me. I have tested further and can confirm this does not happen doing the same thing (on the same Operator deployment) with server 6.6.3, 7.0.0, 7.0.1, or 7.0.2, leading me to suspect it may be an issue with server.

        Attachments

          Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

            Activity

            Hide
            hareen.kancharla Hareen Kancharla added a comment -

            Roo Thorp: Quick clarification question, Do you see "rebalance" set to false much after the rebalance has finished?

            "operator repeatedly tries to rebalance because /pools/default responds with "balanced": false"

            How frequently and how often is the operator calling the pools/default API?

            Show
            hareen.kancharla Hareen Kancharla added a comment - Roo Thorp : Quick clarification question, Do you see "rebalance" set to false much after the rebalance has finished? "operator repeatedly tries to rebalance because /pools/default responds with "balanced": false" How frequently and how often is the operator calling the pools/default API?
            Hide
            roo.thorp Roo Thorp added a comment -

            Hi Hareen Kancharla,

            Looking at the logs, it looks like when not rebalancing we query pool/default every ~3 seconds, and during a rebalance its every ~1 second.

            We always see "balanced":false - in the logs I cannot find an instance of this being true.

            Show
            roo.thorp Roo Thorp added a comment - Hi Hareen Kancharla , Looking at the logs, it looks like when not rebalancing we query pool/default every ~3 seconds, and during a rebalance its every ~1 second. We always see "balanced":false - in the logs I cannot find an instance of this being true.
            Hide
            hareen.kancharla Hareen Kancharla added a comment -

            Thanks Roo Thorp. Let me dig into it further.

            Show
            hareen.kancharla Hareen Kancharla added a comment - Thanks Roo Thorp . Let me dig into it further.
            Hide
            hareen.kancharla Hareen Kancharla added a comment - - edited

            From the logs and code I see:

            1) pools/default handler gets the status of all the services from ns_doctor. In ns_doctor logs I see the service_status of n1ql is always "needs_rebalance: true"

            [ns_doctor:debug,2021-08-20T14:27:57.190Z,ns_1@cb-example-0000.cb-example.default.svc:ns_doctor<0.863.0>:ns_doctor:handle_info:184]Current node statuses:
            [{'ns_1@cb-example-0000.cb-example.default.svc',
                  {{service_status,n1ql},[{connected,true},{needs_rebalance,true}]},
            [ns_doctor:debug,2021-08-20T14:28:57.227Z,ns_1@cb-example-0000.cb-example.default.svc:ns_doctor<0.863.0>:ns_doctor:handle_info:184]Current node statuses:
                  {{service_status,n1ql},[{connected,true},{needs_rebalance,true}]},
            [ns_doctor:debug,2021-08-20T14:29:57.230Z,ns_1@cb-example-0000.cb-example.default.svc:ns_doctor<0.863.0>:ns_doctor:handle_info:184]Current node statuses:
                  {{service_status,n1ql},[{connected,true},{needs_rebalance,true}]}, 
            [ns_doctor:debug,2021-08-20T14:31:57.249Z,ns_1@cb-example-0000.cb-example.default.svc:ns_doctor<0.863.0>:ns_doctor:handle_info:184]Current node statuses:
                  {{service_status,n1ql},[{connected,true},{needs_rebalance,true}]}, 
            

            2) ns_doctor service_status is updated by the n1ql service_agent which gets incorrect json rpc response for "GetCurrentTopology" from N1QL service. We expect the "nodes" in the response below to be node UUID's, but we receive the node names in the response.

            [json_rpc:debug,2021-08-20T14:32:44.309Z,ns_1@cb-example-0000.cb-example.default.svc:json_rpc_connection-n1ql-service_api<0.1086.0>:json_rpc_connection:handle_call:152]sending jsonrpc call:{[{jsonrpc,<<"2.0">>},
                                   {id,290},
                                   {method,<<"ServiceAPI.GetCurrentTopology">>},
                                   {params,[{[{rev,null},{timeout,30000}]}]}]}
            [json_rpc:debug,2021-08-20T14:32:44.312Z,ns_1@cb-example-0000.cb-example.default.svc:json_rpc_connection-n1ql-service_api<0.1086.0>:json_rpc_connection:handle_info:88]got response: [{<<"id">>,290},
                           {<<"result">>,
                            {[{<<"rev">>,<<"AAAAAAAAABg=">>},
                              {<<"nodes">>,
                               [<<"cb-example-0000.cb-example.default.svc:8091">>,  ##### HK: These should be node UUIDs and not node names.
                                <<"cb-example-0001.cb-example.default.svc:8091">>,
                                <<"cb-example-0002.cb-example.default.svc:8091">>]},
                              {<<"isBalanced">>,true}]}},
                           {<<"error">>,null}]
            [json_rpc:debug,2021-08-20T14:32:44.315Z,ns_1@cb-example-0000.cb-example.default.svc:json_rpc_connection-n1ql-service_api<0.1086.0>:json_rpc_connection:handle_call:152]sending jsonrpc call:{[{jsonrpc,<<"2.0">>},
                                   {id,294},
                                   {method,<<"ServiceAPI.GetCurrentTopology">>},
                                   {params,[{[{rev,null},{timeout,30000}]}]}]}
            [json_rpc:debug,2021-08-20T14:32:44.316Z,ns_1@cb-example-0000.cb-example.default.svc:json_rpc_connection-n1ql-service_api<0.1086.0>:json_rpc_connection:handle_info:88]got response: [{<<"id">>,294},
                           {<<"result">>,
                            {[{<<"rev">>,<<"AAAAAAAAABg=">>},
                              {<<"nodes">>,
                               [<<"cb-example-0000.cb-example.default.svc:8091">>,         ###### HK: we expect these to be UUID's. 
                                <<"cb-example-0001.cb-example.default.svc:8091">>,
                                <<"cb-example-0002.cb-example.default.svc:8091">>]},
                              {<<"isBalanced">>,true}]}},
                           {<<"error">>,null}]
            

            Moving the ticket to the N1QL team to take a look at it further.

            Show
            hareen.kancharla Hareen Kancharla added a comment - - edited From the logs and code I see: 1) pools/default handler gets the status of all the services from ns_doctor. In ns_doctor logs I see the service_status of n1ql is always "needs_rebalance: true" [ns_doctor:debug,2021-08-20T14:27:57.190Z,ns_1@cb-example-0000.cb-example.default.svc:ns_doctor<0.863.0>:ns_doctor:handle_info:184]Current node statuses: [{'ns_1@cb-example-0000.cb-example.default.svc', {{service_status,n1ql},[{connected,true},{needs_rebalance,true}]}, [ns_doctor:debug,2021-08-20T14:28:57.227Z,ns_1@cb-example-0000.cb-example.default.svc:ns_doctor<0.863.0>:ns_doctor:handle_info:184]Current node statuses: {{service_status,n1ql},[{connected,true},{needs_rebalance,true}]}, [ns_doctor:debug,2021-08-20T14:29:57.230Z,ns_1@cb-example-0000.cb-example.default.svc:ns_doctor<0.863.0>:ns_doctor:handle_info:184]Current node statuses: {{service_status,n1ql},[{connected,true},{needs_rebalance,true}]}, [ns_doctor:debug,2021-08-20T14:31:57.249Z,ns_1@cb-example-0000.cb-example.default.svc:ns_doctor<0.863.0>:ns_doctor:handle_info:184]Current node statuses: {{service_status,n1ql},[{connected,true},{needs_rebalance,true}]}, 2) ns_doctor service_status is updated by the n1ql service_agent which gets incorrect json rpc response for "GetCurrentTopology" from N1QL service. We expect the "nodes" in the response below to be node UUID's, but we receive the node names in the response. [json_rpc:debug,2021-08-20T14:32:44.309Z,ns_1@cb-example-0000.cb-example.default.svc:json_rpc_connection-n1ql-service_api<0.1086.0>:json_rpc_connection:handle_call:152]sending jsonrpc call:{[{jsonrpc,<<"2.0">>}, {id,290}, {method,<<"ServiceAPI.GetCurrentTopology">>}, {params,[{[{rev,null},{timeout,30000}]}]}]} [json_rpc:debug,2021-08-20T14:32:44.312Z,ns_1@cb-example-0000.cb-example.default.svc:json_rpc_connection-n1ql-service_api<0.1086.0>:json_rpc_connection:handle_info:88]got response: [{<<"id">>,290}, {<<"result">>, {[{<<"rev">>,<<"AAAAAAAAABg=">>}, {<<"nodes">>, [<<"cb-example-0000.cb-example.default.svc:8091">>, ##### HK: These should be node UUIDs and not node names. <<"cb-example-0001.cb-example.default.svc:8091">>, <<"cb-example-0002.cb-example.default.svc:8091">>]}, {<<"isBalanced">>,true}]}}, {<<"error">>,null}] [json_rpc:debug,2021-08-20T14:32:44.315Z,ns_1@cb-example-0000.cb-example.default.svc:json_rpc_connection-n1ql-service_api<0.1086.0>:json_rpc_connection:handle_call:152]sending jsonrpc call:{[{jsonrpc,<<"2.0">>}, {id,294}, {method,<<"ServiceAPI.GetCurrentTopology">>}, {params,[{[{rev,null},{timeout,30000}]}]}]} [json_rpc:debug,2021-08-20T14:32:44.316Z,ns_1@cb-example-0000.cb-example.default.svc:json_rpc_connection-n1ql-service_api<0.1086.0>:json_rpc_connection:handle_info:88]got response: [{<<"id">>,294}, {<<"result">>, {[{<<"rev">>,<<"AAAAAAAAABg=">>}, {<<"nodes">>, [<<"cb-example-0000.cb-example.default.svc:8091">>, ###### HK: we expect these to be UUID's. <<"cb-example-0001.cb-example.default.svc:8091">>, <<"cb-example-0002.cb-example.default.svc:8091">>]}, {<<"isBalanced">>,true}]}}, {<<"error">>,null}] Moving the ticket to the N1QL team to take a look at it further.
            Hide
            build-team Couchbase Build Team added a comment -

            Build couchbase-server-7.1.0-1198 contains query commit 006703e with commit message:
            MB-48077 Report topology as UUIDs not host names

            Show
            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1198 contains query commit 006703e with commit message: MB-48077 Report topology as UUIDs not host names
            Hide
            Donald.haggart Donald Haggart added a comment -

            Should be able to verify by simply repeating the testing using an appropriate build. (couchbase-server-7.1.0-1198 or later)

            Show
            Donald.haggart Donald Haggart added a comment - Should be able to verify by simply repeating the testing using an appropriate build. (couchbase-server-7.1.0-1198 or later)
            Hide
            roo.thorp Roo Thorp added a comment -

            Hi Donald Haggart,

            Thanks for this! I'll need to wait a bit for this build to be docker-ized, but when it's available I'll check and update the ticket. Thanks!

            Show
            roo.thorp Roo Thorp added a comment - Hi Donald Haggart , Thanks for this! I'll need to wait a bit for this build to be docker-ized, but when it's available I'll check and update the ticket. Thanks!

              People

              Assignee:
              roo.thorp Roo Thorp
              Reporter:
              roo.thorp Roo Thorp
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Gerrit Reviews

                  There are no open Gerrit changes

                    PagerDuty