When creating (or upgrading to) an operator deployment using server 7.1.0-1169 (the most recently available docker image), operator repeatedly tries to rebalance because /pools/default responds with "balanced": false (full response attached). However, the server UI reports that the rebalance was successful (image & rebalance report attached).
This happens with a basic operator deployment - no buckets, data, etc. on the cluster, with no interaction from me. I have tested further and can confirm this does not happen doing the same thing (on the same Operator deployment) with server 6.6.3, 7.0.0, 7.0.1, or 7.0.2, leading me to suspect it may be an issue with server.
Roo Thorp: Quick clarification question, Do you see "rebalance" set to false much after the rebalance has finished?
"operator repeatedly tries to rebalance because /pools/default responds with "balanced": false"
How frequently and how often is the operator calling the pools/default API?
Hareen Kancharla
added a comment - Roo Thorp : Quick clarification question, Do you see "rebalance" set to false much after the rebalance has finished?
"operator repeatedly tries to rebalance because /pools/default responds with "balanced": false"
How frequently and how often is the operator calling the pools/default API?
Looking at the logs, it looks like when not rebalancing we query pool/default every ~3 seconds, and during a rebalance its every ~1 second.
We always see "balanced":false - in the logs I cannot find an instance of this being true.
Roo Thorp
added a comment - Hi Hareen Kancharla ,
Looking at the logs, it looks like when not rebalancing we query pool/default every ~3 seconds, and during a rebalance its every ~1 second.
We always see "balanced":false - in the logs I cannot find an instance of this being true.
1) pools/default handler gets the status of all the services from ns_doctor. In ns_doctor logs I see the service_status of n1ql is always "needs_rebalance: true"
2) ns_doctor service_status is updated by the n1ql service_agent which gets incorrect json rpc response for "GetCurrentTopology" from N1QL service. We expect the "nodes" in the response below to be node UUID's, but we receive the node names in the response.
Moving the ticket to the N1QL team to take a look at it further.
Hareen Kancharla
added a comment - - edited From the logs and code I see:
1) pools/default handler gets the status of all the services from ns_doctor. In ns_doctor logs I see the service_status of n1ql is always "needs_rebalance: true"
[ns_doctor:debug,2021-08-20T14:27:57.190Z,ns_1@cb-example-0000.cb-example.default.svc:ns_doctor<0.863.0>:ns_doctor:handle_info:184]Current node statuses:
[{'ns_1@cb-example-0000.cb-example.default.svc',
{{service_status,n1ql},[{connected,true},{needs_rebalance,true}]},
[ns_doctor:debug,2021-08-20T14:28:57.227Z,ns_1@cb-example-0000.cb-example.default.svc:ns_doctor<0.863.0>:ns_doctor:handle_info:184]Current node statuses:
{{service_status,n1ql},[{connected,true},{needs_rebalance,true}]},
[ns_doctor:debug,2021-08-20T14:29:57.230Z,ns_1@cb-example-0000.cb-example.default.svc:ns_doctor<0.863.0>:ns_doctor:handle_info:184]Current node statuses:
{{service_status,n1ql},[{connected,true},{needs_rebalance,true}]},
[ns_doctor:debug,2021-08-20T14:31:57.249Z,ns_1@cb-example-0000.cb-example.default.svc:ns_doctor<0.863.0>:ns_doctor:handle_info:184]Current node statuses:
{{service_status,n1ql},[{connected,true},{needs_rebalance,true}]},
2) ns_doctor service_status is updated by the n1ql service_agent which gets incorrect json rpc response for "GetCurrentTopology" from N1QL service. We expect the "nodes" in the response below to be node UUID's, but we receive the node names in the response.
[json_rpc:debug,2021-08-20T14:32:44.309Z,ns_1@cb-example-0000.cb-example.default.svc:json_rpc_connection-n1ql-service_api<0.1086.0>:json_rpc_connection:handle_call:152]sending jsonrpc call:{[{jsonrpc,<<"2.0">>},
{id,290},
{method,<<"ServiceAPI.GetCurrentTopology">>},
{params,[{[{rev,null},{timeout,30000}]}]}]}
[json_rpc:debug,2021-08-20T14:32:44.312Z,ns_1@cb-example-0000.cb-example.default.svc:json_rpc_connection-n1ql-service_api<0.1086.0>:json_rpc_connection:handle_info:88]got response: [{<<"id">>,290},
{<<"result">>,
{[{<<"rev">>,<<"AAAAAAAAABg=">>},
{<<"nodes">>,
[<<"cb-example-0000.cb-example.default.svc:8091">>, ##### HK: These should be node UUIDs and not node names.
<<"cb-example-0001.cb-example.default.svc:8091">>,
<<"cb-example-0002.cb-example.default.svc:8091">>]},
{<<"isBalanced">>,true}]}},
{<<"error">>,null}]
[json_rpc:debug,2021-08-20T14:32:44.315Z,ns_1@cb-example-0000.cb-example.default.svc:json_rpc_connection-n1ql-service_api<0.1086.0>:json_rpc_connection:handle_call:152]sending jsonrpc call:{[{jsonrpc,<<"2.0">>},
{id,294},
{method,<<"ServiceAPI.GetCurrentTopology">>},
{params,[{[{rev,null},{timeout,30000}]}]}]}
[json_rpc:debug,2021-08-20T14:32:44.316Z,ns_1@cb-example-0000.cb-example.default.svc:json_rpc_connection-n1ql-service_api<0.1086.0>:json_rpc_connection:handle_info:88]got response: [{<<"id">>,294},
{<<"result">>,
{[{<<"rev">>,<<"AAAAAAAAABg=">>},
{<<"nodes">>,
[<<"cb-example-0000.cb-example.default.svc:8091">>, ###### HK: we expect these to be UUID's.
<<"cb-example-0001.cb-example.default.svc:8091">>,
<<"cb-example-0002.cb-example.default.svc:8091">>]},
{<<"isBalanced">>,true}]}},
{<<"error">>,null}]
Moving the ticket to the N1QL team to take a look at it further.
Build couchbase-server-7.1.0-1198 contains query commit 006703e with commit message: MB-48077 Report topology as UUIDs not host names
Couchbase Build Team
added a comment - Build couchbase-server-7.1.0-1198 contains query commit 006703e with commit message:
MB-48077 Report topology as UUIDs not host names
Should be able to verify by simply repeating the testing using an appropriate build. (couchbase-server-7.1.0-1198 or later)
Donald Haggart
added a comment - Should be able to verify by simply repeating the testing using an appropriate build. (couchbase-server-7.1.0-1198 or later)
Thanks for this! I'll need to wait a bit for this build to be docker-ized, but when it's available I'll check and update the ticket. Thanks!
Roo Thorp
added a comment - Hi Donald Haggart ,
Thanks for this! I'll need to wait a bit for this build to be docker-ized, but when it's available I'll check and update the ticket. Thanks!
Roo Thorp: Quick clarification question, Do you see "rebalance" set to false much after the rebalance has finished?
"operator repeatedly tries to rebalance because /pools/default responds with "balanced": false"
How frequently and how often is the operator calling the pools/default API?