Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-34911

[high-bucket] FTS: Rebalance failed after failover in high bucket density test

    XMLWordPrintable

Details

    • Untriaged
    • Unknown

    Attachments

      Issue Links

        For Gerrit Dashboard: MB-34911
        # Subject Branch Project Status CR V

        Activity

          mahesh.mandhare Mahesh Mandhare (Inactive) created issue -
          raju Raju Suravarjjala made changes -
          Field Original Value New Value
          Fix Version/s Mad-Hatter [ 15037 ]
          raju Raju Suravarjjala made changes -
          Priority Major [ 3 ] Critical [ 2 ]
          steve.watanabe Steve Watanabe made changes -
          Assignee Ajit Yagaty [ ajit.yagaty ] Steve Watanabe [ steve.watanabe ]

          Mahesh Mandhare I don't see a "bugdb" directory in the couchbase s3.amazonaws.com area. I'm using this access key: AKIAIYNLVHFWDSSGDORQ
          Please provide the amazonaws access key to get to the files....or if they're located elsewhere please point to them.

          steve.watanabe Steve Watanabe added a comment - Mahesh Mandhare I don't see a "bugdb" directory in the couchbase s3.amazonaws.com area. I'm using this access key: AKIAIYNLVHFWDSSGDORQ Please provide the amazonaws access key to get to the files....or if they're located elsewhere please point to them.
          steve.watanabe Steve Watanabe made changes -
          Assignee Steve Watanabe [ steve.watanabe ] Mahesh Mandhare [ mahesh.mandhare ]
          mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - Build 6.5.0-4059 Observed rebalance failure after failover in high bucket density test with reason - Rebalance exited with reason {service_rebalance_failed,fts, {linked_process_died,<25470.31340.268>, {timeout, {gen_server,call, [<25470.4583.0>, {call,"ServiceAPI.GetCurrentTopology", #Fun<json_rpc_connection.0.102434519>}, 60000]}}}}. Rebalance Operation Id = c260a39df60307f520b3b5aa4eb641d3 Logs- Data nodes:   https://cb-jira.s3.us-east-2.amazonaws.com/logs/hbd_after_failover_reb_fail/collectinfo-2019-09-05T093920-ns_1%40172.23.97.12.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/hbd_after_failover_reb_fail/collectinfo-2019-09-05T093920-ns_1%40172.23.97.13.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/hbd_after_failover_reb_fail/collectinfo-2019-09-05T093920-ns_1%40172.23.97.14.zip Other nodes: https://cb-jira.s3.us-east-2.amazonaws.com/logs/hbd_after_failover_reb_fail/collectinfo-2019-09-05T093920-ns_1%40172.23.96.20.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/hbd_after_failover_reb_fail/collectinfo-2019-09-05T093920-ns_1%40172.23.96.23.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/hbd_after_failover_reb_fail/collectinfo-2019-09-05T093920-ns_1%40172.23.97.177.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/hbd_after_failover_reb_fail/collectinfo-2019-09-05T093920-ns_1%40172.23.97.19.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/hbd_after_failover_reb_fail/collectinfo-2019-09-05T093920-ns_1%40172.23.97.20.zip
          mahesh.mandhare Mahesh Mandhare (Inactive) made changes -
          Assignee Mahesh Mandhare [ mahesh.mandhare ] Steve Watanabe [ steve.watanabe ]

          Looking at this crash from .97.12

          [error_logger:error,2019-09-05T01:29:03.818-07:00,ns_1@172.23.97.12:error_logger<0.32.0>:ale_error_logger_handler:do_log:203]
          =========================CRASH REPORT=========================
            crasher:
              initial call: service_rebalancer:'-run_rebalance/7-fun-1-'/0
              pid: <0.3890.876>
              registered_name: 'service_rebalancer-fts-worker'
              exception exit: {{linked_process_died,<25470.31340.268>,
                                {timeout,
                                 {gen_server,call,
                                  [<25470.4583.0>,
                                   {call,"ServiceAPI.GetCurrentTopology",
                                    #Fun<json_rpc_connection.0.102434519>},
                                   60000]}}},
                               {gen_server,call,
                                [{'service_agent-fts','ns_1@172.23.96.20'},
                                 {if_rebalance,<0.10998.876>,
                                  {start_rebalance,
                                   <<"630e458abd4db2a0c89d8510c2cec8b7">>,rebalance,
                                   [{[{node_id,<<"35086b9bc71f1dd28b65da1ccff3bcf7">>},
                                      {priority,0},
                                      {opaque,null}],
                                     full}],
                                   [],<0.3890.876>}},
                                 90000]}}
                in function  gen_server:call/3 (gen_server.erl, line 214)
                in call from service_rebalancer:rebalance/8 (src/service_rebalancer.erl, line 126)
              ancestors: ['service_rebalancer-fts',<0.20654.829>]
              message_queue_len: 0
              messages: []
              links: [<0.10998.876>]
              dictionary: []
              trap_exit: false
              status: running
              heap_size: 2586
              stack_size: 27
              reductions: 7370
            neighbours:
          

          We're doing a service rebalance of fts and it timed out while processing a GetCurrentTopology request

          Looking on node .96.20 which is where the GetCurrentTopology was being done (ns_server.json_rpc.log).
          Note the RPC was sent 1 minute prior to the crash above on .97.12 which is the expected time ?RPC_TIMEOUT in
          service_api.erl

          [json_rpc:debug,2019-09-05T01:28:03.815-07:00,ns_1@172.23.96.20:json_rpc_connection-fts-service_api<0.4583.0>:json_rpc_connection:handle_call:158]sending jsonrpc call:{[{jsonrpc,<<"2.0">>},
                                 {id,9305},
                                 {method,<<"ServiceAPI.GetCurrentTopology">>},
                                 {params,[{[{rev,<<"NTQ=">>},{timeout,30000}]}]}]}
          

          But there's no response in the log. Normally we'd see something similar to
          the following which is the response to id:9303

          [json_rpc:debug,2019-09-05T01:28:03.815-07:00,ns_1@172.23.96.20:json_rpc_connection-fts-service_api<0.4583.0>:json_rpc_connection:handle_info:94]got response: [{<<"id">>,9303},
                         {<<"result">>,
                          {[{<<"rev">>,<<"NTQ=">>},
                            {<<"nodes">>,[<<"35086b9bc71f1dd28b65da1ccff3bcf7">>]},
                            {<<"isBalanced">>,true}]}},
                         {<<"error">>,null}]
          

          Looking at the ns_server.fts.log on .96.20 I don't see anything obvious related to this...but then again, I don't know that code

          Discussing with Aliaksey he wondered if it might have something to do with GetCurrentTopology cancellation.

          In godeps/src/github.com/couchbase/cbauth/service there's this code which if the initial GetCurrentTopology
          is cancelled then the GetCurrentTopology is called again with (nil, nil) which effectively means to return the
          current topology info. If the fts manager servicing this didn't process the cancel this behavior might occur

          func (s serviceAPI) GetCurrentTopology(req GetCurrentTopologyReq, res *Topology) error {
              var topology *Topology
              var err error
           
              withTimeout(req.Timeout, func(cancel Cancel) {
                  topology, err = s.mgr.GetCurrentTopology(req.Rev, cancel)
                  if err == ErrCanceled {
                      topology, err = s.mgr.GetCurrentTopology(nil, nil)
                  }
              })
           
              if err == nil {
                  *res = *topology
              }
           
              return err
          }
          

          In the fts code (goproj/src/github.com/couchbase/cbgt/ctl/manager.go) there is this code

          func (m *CtlMgr) GetCurrentTopology(haveTopologyRev service.Revision,
              cancelCh service.Cancel) (*service.Topology, error) {
              ctlTopology, err :=
                  m.ctl.WaitGetTopology(string(haveTopologyRev), cancelCh)
              if err != nil {
                  if err != service.ErrCanceled {
                      log.Errorf("ctl/manager: GetCurrentTopology,"+
                          " haveTopologyRev: %s, err: %v", haveTopologyRev, err)
                  }
           
                  return nil, err
              }
          

          and in the ns_server.fts.log (.96.20) there's this log...but it's ten minutes later...so don't know if it's related,

          2019-09-05T01:38:21.643-07:00 [INFO] ctl/manager: GetCurrentTopology, haveTopologyRev: , changed, rv: &{Rev:[53 54] Nodes:[35086b9bc71f1dd28b65da1ccff3bcf7] IsBalanced:false Messages:[error: operation_canceled]}
          

          Assigning to the FTS folks for further triage.

          steve.watanabe Steve Watanabe added a comment - Looking at this crash from .97.12 [error_logger:error,2019-09-05T01:29:03.818-07:00,ns_1@172.23.97.12:error_logger<0.32.0>:ale_error_logger_handler:do_log:203] =========================CRASH REPORT========================= crasher: initial call: service_rebalancer:'-run_rebalance/7-fun-1-'/0 pid: <0.3890.876> registered_name: 'service_rebalancer-fts-worker' exception exit: {{linked_process_died,<25470.31340.268>, {timeout, {gen_server,call, [<25470.4583.0>, {call,"ServiceAPI.GetCurrentTopology", #Fun<json_rpc_connection.0.102434519>}, 60000]}}}, {gen_server,call, [{'service_agent-fts','ns_1@172.23.96.20'}, {if_rebalance,<0.10998.876>, {start_rebalance, <<"630e458abd4db2a0c89d8510c2cec8b7">>,rebalance, [{[{node_id,<<"35086b9bc71f1dd28b65da1ccff3bcf7">>}, {priority,0}, {opaque,null}], full}], [],<0.3890.876>}}, 90000]}} in function gen_server:call/3 (gen_server.erl, line 214) in call from service_rebalancer:rebalance/8 (src/service_rebalancer.erl, line 126) ancestors: ['service_rebalancer-fts',<0.20654.829>] message_queue_len: 0 messages: [] links: [<0.10998.876>] dictionary: [] trap_exit: false status: running heap_size: 2586 stack_size: 27 reductions: 7370 neighbours: We're doing a service rebalance of fts and it timed out while processing a GetCurrentTopology request Looking on node .96.20 which is where the GetCurrentTopology was being done (ns_server.json_rpc.log). Note the RPC was sent 1 minute prior to the crash above on .97.12 which is the expected time ?RPC_TIMEOUT in service_api.erl [json_rpc:debug,2019-09-05T01:28:03.815-07:00,ns_1@172.23.96.20:json_rpc_connection-fts-service_api<0.4583.0>:json_rpc_connection:handle_call:158]sending jsonrpc call:{[{jsonrpc,<<"2.0">>}, {id,9305}, {method,<<"ServiceAPI.GetCurrentTopology">>}, {params,[{[{rev,<<"NTQ=">>},{timeout,30000}]}]}]} But there's no response in the log. Normally we'd see something similar to the following which is the response to id:9303 [json_rpc:debug,2019-09-05T01:28:03.815-07:00,ns_1@172.23.96.20:json_rpc_connection-fts-service_api<0.4583.0>:json_rpc_connection:handle_info:94]got response: [{<<"id">>,9303}, {<<"result">>, {[{<<"rev">>,<<"NTQ=">>}, {<<"nodes">>,[<<"35086b9bc71f1dd28b65da1ccff3bcf7">>]}, {<<"isBalanced">>,true}]}}, {<<"error">>,null}] Looking at the ns_server.fts.log on .96.20 I don't see anything obvious related to this...but then again, I don't know that code Discussing with Aliaksey he wondered if it might have something to do with GetCurrentTopology cancellation. In godeps/src/github.com/couchbase/cbauth/service there's this code which if the initial GetCurrentTopology is cancelled then the GetCurrentTopology is called again with (nil, nil) which effectively means to return the current topology info. If the fts manager servicing this didn't process the cancel this behavior might occur func (s serviceAPI) GetCurrentTopology(req GetCurrentTopologyReq, res *Topology) error { var topology *Topology var err error   withTimeout(req.Timeout, func(cancel Cancel) { topology, err = s.mgr.GetCurrentTopology(req.Rev, cancel) if err == ErrCanceled { topology, err = s.mgr.GetCurrentTopology(nil, nil) } })   if err == nil { *res = *topology }   return err } In the fts code (goproj/src/github.com/couchbase/cbgt/ctl/manager.go) there is this code func (m *CtlMgr) GetCurrentTopology(haveTopologyRev service.Revision, cancelCh service.Cancel) (*service.Topology, error) { ctlTopology, err := m.ctl.WaitGetTopology(string(haveTopologyRev), cancelCh) if err != nil { if err != service.ErrCanceled { log.Errorf("ctl/manager: GetCurrentTopology,"+ " haveTopologyRev: %s, err: %v", haveTopologyRev, err) }   return nil, err } and in the ns_server.fts.log (.96.20) there's this log...but it's ten minutes later...so don't know if it's related, 2019-09-05T01:38:21.643-07:00 [INFO] ctl/manager: GetCurrentTopology, haveTopologyRev: , changed, rv: &{Rev:[53 54] Nodes:[35086b9bc71f1dd28b65da1ccff3bcf7] IsBalanced:false Messages:[error: operation_canceled]} Assigning to the FTS folks for further triage.
          steve.watanabe Steve Watanabe made changes -
          Assignee Steve Watanabe [ steve.watanabe ] Ajit Yagaty [ ajit.yagaty ]
          steve.watanabe Steve Watanabe made changes -
          Component/s fts [ 12726 ]
          Component/s ns_server [ 10019 ]
          steve.watanabe Steve Watanabe made changes -
          Assignee Ajit Yagaty [ ajit.yagaty ] Keshav Murthy [ keshav ]
          keshav Keshav Murthy made changes -
          Assignee Keshav Murthy [ keshav ] Steve Yen [ steve ]

          Additional tidbits... Normally, StartTopologyChange requests are handled
          fairly quickly in FTS

           

          2019-09-04T15:16:28.633-07:00 [INFO] ctl/manager: StartTopologyChange, change: {b02b3c57e0718f7efc3681d0326ec000 [] topology-change-rebalance [{{35086b9bc71f1dd28b65da1ccff3bcf7 0 <nil>} recovery-full}] []}
          2019-09-04T15:16:30.115-07:00 [INFO] ctl/manager: StartTopologyChange, started
          

          In this particular case it takes ten minutes.

           

          2019-09-05T01:28:15.511-07:00 [INFO] ctl/manager: StartTopologyChange, change: {630e458abd4db2a0c89d8510c2cec8b7 [] topology-change-rebalance [{{35086b9bc71f1dd28b65da1ccff3bcf7 0 <nil>} recovery-full}] []}
          2019-09-05T01:38:15.214-07:00 [INFO] ctl/manager: StartTopologyChange, started
          

          During this time, service_agent is being restarted and fails continuously.  Another indication that fts handling of these requests appears wedged.

          [ns_server:error,2019-09-05T01:30:03.818-07:00,ns_1@172.23.96.20:<0.23106.295>:service_agent:wait_for_connection_loop:299]No connection with label "fts-service_api" after 60000ms. Exiting.
          [ns_server:error,2019-09-05T01:30:03.818-07:00,ns_1@172.23.96.20:service_agent-fts<0.23029.295>:service_agent:handle_info:231]Linked process <0.23106.295> died with reason {no_connection,
           "fts-service_api"}. Terminating

          steve.watanabe Steve Watanabe added a comment - Additional tidbits... Normally, StartTopologyChange requests are handled fairly quickly in FTS   2019-09-04T15:16:28.633-07:00 [INFO] ctl/manager: StartTopologyChange, change: {b02b3c57e0718f7efc3681d0326ec000 [] topology-change-rebalance [{{35086b9bc71f1dd28b65da1ccff3bcf7 0 <nil>} recovery-full}] []} 2019-09-04T15:16:30.115-07:00 [INFO] ctl/manager: StartTopologyChange, started In this particular case it takes ten minutes.   2019-09-05T01:28:15.511-07:00 [INFO] ctl/manager: StartTopologyChange, change: {630e458abd4db2a0c89d8510c2cec8b7 [] topology-change-rebalance [{{35086b9bc71f1dd28b65da1ccff3bcf7 0 <nil>} recovery-full}] []} 2019-09-05T01:38:15.214-07:00 [INFO] ctl/manager: StartTopologyChange, started During this time, service_agent is being restarted and fails continuously.  Another indication that fts handling of these requests appears wedged. [ns_server:error,2019-09-05T01:30:03.818-07:00,ns_1@172.23.96.20:<0.23106.295>:service_agent:wait_for_connection_loop:299]No connection with label "fts-service_api" after 60000ms. Exiting. [ns_server:error,2019-09-05T01:30:03.818-07:00,ns_1@172.23.96.20:service_agent-fts<0.23029.295>:service_agent:handle_info:231]Linked process <0.23106.295> died with reason {no_connection, "fts-service_api"}. Terminating

          Hey Steve Yen, I haven't tested this fix, but could you take a look at this: http://review.couchbase.org/#/c/114728/

          Essentially it's the same way we dealt with GetTaskList - adding a timeout.

          abhinav Abhinav Dangeti added a comment - Hey Steve Yen , I haven't tested this fix, but could you take a look at this:  http://review.couchbase.org/#/c/114728/ Essentially it's the same way we dealt with GetTaskList - adding a timeout.
          steve Steve Yen made changes -
          Summary Rebalance failed after failover in high bucket density test FTS: Rebalance failed after failover in high bucket density test
          steve Steve Yen added a comment -

          Hi Abhinav – thanks for the patch. Going to assign this ticket to you with my appreciation – for having this more loaded more freshly into your brain in contrast to myself.

          steve Steve Yen added a comment - Hi Abhinav – thanks for the patch. Going to assign this ticket to you with my appreciation – for having this more loaded more freshly into your brain in contrast to myself.
          steve Steve Yen made changes -
          Assignee Steve Yen [ steve ] Abhinav Dangeti [ abhinav ]
          Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - - edited

          Just trying to bring the issue summary to the ticket,

          GetCurrentTopology hangs due to the blocked WaitGetTopology. And that blocks due to the lock deprivation here - https://github.com/couchbase/cbgt/blob/master/ctl/ctl.go#L492

          And this lock is blocked/held with the getMovingPartitionsCount() call which got triggered from StartTopologyChange().

          getMovingPartitionsCount() is failing at cbgt.CouchbasePartitions at https://github.com/couchbase/cbgt/blob/master/ctl/ctl.go#L209 while holding the lock.

          And this is evident from the fts traces : 

          2019-09-05T01:28:36.800-07:00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http://127.0.0.1:8091, poolName: default, bucketName: bucket-9, sourceParams: "{}", err: Get http://127.0.0.1:8091/pools/default/buckets?v=1531394&uuid=c645431935b42d6c9b024e8f10ca1c81: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

          It happens 24 times between min 28 and 38, ie in that 10 minutes window mentioned in early comments.

          If we look into the FTS logs, there are many endpoint timeout/cbauth db errors like below.

          eg:

          2019-09-05T01:28:36.801-07:00 [INFO] feed_dcp: OnError, name: bucket-10-fts_more_words_30d403f3d2c087ec_aa574717: bucketName: bucket-10, bucketUUID: , err: Get http://127.0.0.1:8091/pools/default/buckets?v=1531394&uuid=c645431935b42d6c9b024e8f10ca1c81: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

          2019-09-05T01:38:17.265-07:00 [INFO] feed_dcp: OnError, name: bucket-3-fts_less_words_7550c79ad00c55c1_18572d87: bucketName: bucket-3, bucketUUID: , err: worker auth, AuthenticateMemcachedConn, server: 172.23.97.15:11210, err: Unable to find given hostport in cbauth database: `172.23.97.15:11210'
          2019-09-05T01:38:17.266-07:00 [INFO] main: meh.OnFeedError, srcType: couchbase, err: worker auth, AuthenticateMemcachedConn, server: 172.23.97.15:11210, err: Unable to find given hostport in cbauth database: `172.23.97.15:11210'
          2019-09-05T01:38:17.266-07:00 [INFO] feed_dcp: OnError, name: bucket-18-fts_more_words_26b92664c7b3842c_aa574717: bucketName: bucket-18, bucketUUID: , err: worker auth, AuthenticateMemcachedConn, server: 172.23.97.15:11210, err: Unable to find given hostport in cbauth database: `172.23.97.15:11210'

           

          Abhinav Dangeti did you get a chance to look at those or any theories?

          Perhaps, we can have a bail out option from the getMovingPartitionCount() based on some signal channels/finite retry attempts etc, but then these sort of errors mostly points some other major wrinkles in the cluster health/sanity?

           

          Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - - edited Just trying to bring the issue summary to the ticket, GetCurrentTopology hangs due to the blocked WaitGetTopology. And that blocks due to the lock deprivation here -  https://github.com/couchbase/cbgt/blob/master/ctl/ctl.go#L492 .  And this lock is blocked/held with the getMovingPartitionsCount() call which got triggered from StartTopologyChange(). getMovingPartitionsCount() is failing at cbgt.CouchbasePartitions at  https://github.com/couchbase/cbgt/blob/master/ctl/ctl.go#L209  while holding the lock. And this is evident from the fts traces :  2019-09-05T01:28:36.800-07:00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http://127.0.0.1:8091 , poolName: default, bucketName: bucket-9, sourceParams: "{}", err: Get http://127.0.0.1:8091/pools/default/buckets?v=1531394&uuid=c645431935b42d6c9b024e8f10ca1c81: net/http: request canceled (Client.Timeout exceeded while awaiting headers) It happens 24 times between min 28 and 38, ie in that 10 minutes window mentioned in early comments. If we look into the FTS logs, there are many endpoint timeout/cbauth db errors like below. eg: 2019-09-05T01:28:36.801-07:00 [INFO] feed_dcp: OnError, name: bucket-10-fts_more_words_30d403f3d2c087ec_aa574717: bucketName: bucket-10, bucketUUID: , err: Get http://127.0.0.1:8091/pools/default/buckets?v=1531394&uuid=c645431935b42d6c9b024e8f10ca1c81: net/http: request canceled (Client.Timeout exceeded while awaiting headers) 2019-09-05T01:38:17.265-07:00 [INFO] feed_dcp: OnError, name: bucket-3-fts_less_words_7550c79ad00c55c1_18572d87: bucketName: bucket-3, bucketUUID: , err: worker auth, AuthenticateMemcachedConn, server: 172.23.97.15:11210, err: Unable to find given hostport in cbauth database: `172.23.97.15:11210' 2019-09-05T01:38:17.266-07:00 [INFO] main: meh.OnFeedError, srcType: couchbase, err: worker auth, AuthenticateMemcachedConn, server: 172.23.97.15:11210, err: Unable to find given hostport in cbauth database: `172.23.97.15:11210' 2019-09-05T01:38:17.266-07:00 [INFO] feed_dcp: OnError, name: bucket-18-fts_more_words_26b92664c7b3842c_aa574717: bucketName: bucket-18, bucketUUID: , err: worker auth, AuthenticateMemcachedConn, server: 172.23.97.15:11210, err: Unable to find given hostport in cbauth database: `172.23.97.15:11210'   Abhinav Dangeti  did you get a chance to look at those or any theories? Perhaps, we can have a bail out option from the getMovingPartitionCount() based on some signal channels/finite retry attempts etc, but then these sort of errors mostly points some other major wrinkles in the cluster health/sanity?  

          Sreekanth Sivasankaran This is the line of thinking at the moment yes, that getMovingPartitionsCount(..) within Lock is what's very likely blocking WaitGetTopology(..)'s attempt to acquire the lock. The fact that none of the operations within getMovingPartitionsCount(..) even need the lock was the reason for this change .. http://review.couchbase.org/#/c/114759/ (which you've already approved I see ).

          I'm looking in to see if we can some how pass in a cancel channel to bail out of long running getMovingPartitionsCount(..).

           

          abhinav Abhinav Dangeti added a comment - Sreekanth Sivasankaran  This is the line of thinking at the moment yes, that getMovingPartitionsCount(..) within Lock is what's very likely blocking WaitGetTopology(..)'s attempt to acquire the lock. The fact that none of the operations within getMovingPartitionsCount(..) even need the lock was the reason for this change ..  http://review.couchbase.org/#/c/114759/  (which you've already approved I see ). I'm looking in to see if we can some how pass in a cancel channel to bail out of long running getMovingPartitionsCount(..).  

          Build couchbase-server-6.5.0-4306 contains cbgt commit b5f9985 with commit message:
          MB-34911: ctl.getMovingPartitionsCount(..) does not require mutex

          build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.0-4306 contains cbgt commit b5f9985 with commit message: MB-34911 : ctl.getMovingPartitionsCount(..) does not require mutex

          Build couchbase-server-6.5.0-4306 contains cbgt commit 06b5ab3 with commit message:
          MB-34911: Add timeout condition for CtlMgr's WaitGetTopology API

          build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.0-4306 contains cbgt commit 06b5ab3 with commit message: MB-34911 : Add timeout condition for CtlMgr's WaitGetTopology API

          "I'm looking in to see if we can some how pass in a cancel channel to bail out of long running getMovingPartitionsCount(..)."

          • this may not help the actual rebalance issue here- as doing this would conclude that no partitions to move and a successful rebalance operation where in reality those nodes were unreachable. So there could be non-zero moving partitions? Tracing the root cause for those endpoint/cbauth errors are important for the eventual fix.
          Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - "I'm looking in to see if we can some how pass in a cancel channel to bail out of long running getMovingPartitionsCount(..)." this may not help the actual rebalance issue here- as doing this would conclude that no partitions to move and a successful rebalance operation where in reality those nodes were unreachable. So there could be non-zero moving partitions? Tracing the root cause for those endpoint/cbauth errors are important for the eventual fix.

          The entry for the hostport: 172.23.97.15 was removed from the cbauth database when it was failed over and deactivated at about 2019-09-05T00:06:11.

          The logs show FTS attempting to reconnect to it until the logs were collected at about 2019-09-05T02:55:48. This is concerning and will need looking into, but does not seem to have anything to do with the rebalance failure due to the timeout in GetCurrentTopology. The fix http://review.couchbase.org/#/c/114759/ should ensure that there isn't lock contention anymore, and a cancellation attempt from ns_server should go into effect immediately.

          abhinav Abhinav Dangeti added a comment - The entry for the hostport: 172.23.97.15 was removed from the cbauth database when it was failed over and deactivated at about 2019-09-05T00:06:11. The logs show FTS attempting to reconnect to it until the logs were collected at about 2019-09-05T02:55:48. This is concerning and will need looking into, but does not seem to have anything to do with the rebalance failure due to the timeout in GetCurrentTopology. The fix  http://review.couchbase.org/#/c/114759/  should ensure that there isn't lock contention anymore, and a cancellation attempt from ns_server should go into effect immediately.
          abhinav Abhinav Dangeti added a comment - - edited

          Merged 2 fixes for this and don't have any other changes planned for this at the moment.

          Please re-test with latest build(s).

          abhinav Abhinav Dangeti added a comment - - edited Merged 2 fixes for this and don't have any other changes planned for this at the moment. Please re-test with latest build(s).
          abhinav Abhinav Dangeti made changes -
          Assignee Abhinav Dangeti [ abhinav ] Mahesh Mandhare [ mahesh.mandhare ]
          Resolution Fixed [ 1 ]
          Status Open [ 1 ] Resolved [ 5 ]
          mahesh.mandhare Mahesh Mandhare (Inactive) made changes -
          Resolution Fixed [ 1 ]
          Status Resolved [ 5 ] Reopened [ 4 ]
          mahesh.mandhare Mahesh Mandhare (Inactive) made changes -
          Assignee Mahesh Mandhare [ mahesh.mandhare ] Abhinav Dangeti [ abhinav ]

          It seems like the timeout is still due to the API within FTS waiting to acquire a lock. The lock however is not the ctl/ctl’s lock but ctl/manager’s lock which the API tries ro acquire after the WaitGetTopology timeout is hit. This lock is held by longer running ctl/manager apis - StartTopologyChange in particular during rebalance.

          abhinav Abhinav Dangeti added a comment - It seems like the timeout is still due to the API within FTS waiting to acquire a lock. The lock however is not the ctl/ctl’s lock but ctl/manager’s lock which the API tries ro acquire after the WaitGetTopology timeout is hit. This lock is held by longer running ctl/manager apis - StartTopologyChange in particular during rebalance.

          Build couchbase-server-6.5.0-4522 contains cbgt commit fc5642c with commit message:
          MB-34911: Separate lock for ctl/manager's lastTopology

          build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.0-4522 contains cbgt commit fc5642c with commit message: MB-34911 : Separate lock for ctl/manager's lastTopology

          Build couchbase-server-6.5.0-4522 contains cbgt commit f997309 with commit message:
          MB-34911: WaitGetTopology - Compare uint64s as opposed to strings

          build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.0-4522 contains cbgt commit f997309 with commit message: MB-34911 : WaitGetTopology - Compare uint64s as opposed to strings
          abhinav Abhinav Dangeti made changes -
          Resolution Fixed [ 1 ]
          Status Reopened [ 4 ] Resolved [ 5 ]

          Build couchbase-server-7.0.0-1007 contains cbgt commit fc5642c with commit message:
          MB-34911: Separate lock for ctl/manager's lastTopology

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-1007 contains cbgt commit fc5642c with commit message: MB-34911 : Separate lock for ctl/manager's lastTopology

          Build couchbase-server-7.0.0-1007 contains cbgt commit f997309 with commit message:
          MB-34911: WaitGetTopology - Compare uint64s as opposed to strings

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-1007 contains cbgt commit f997309 with commit message: MB-34911 : WaitGetTopology - Compare uint64s as opposed to strings
          wayne Wayne Siu made changes -
          Assignee Abhinav Dangeti [ abhinav ] Mahesh Mandhare [ mahesh.mandhare ]

          Build sync_gateway-2.7.0-78 contains cbgt commit fc5642c with commit message:
          MB-34911: Separate lock for ctl/manager's lastTopology

          build-team Couchbase Build Team added a comment - Build sync_gateway-2.7.0-78 contains cbgt commit fc5642c with commit message: MB-34911 : Separate lock for ctl/manager's lastTopology

          Build sync_gateway-2.7.0-78 contains cbgt commit f997309 with commit message:
          MB-34911: WaitGetTopology - Compare uint64s as opposed to strings

          build-team Couchbase Build Team added a comment - Build sync_gateway-2.7.0-78 contains cbgt commit f997309 with commit message: MB-34911 : WaitGetTopology - Compare uint64s as opposed to strings
          mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - - edited Build 6.5.0-4744 Observed similar error and rebalance failed after hard failover of kv node. Error- Rebalance exited with reason {service_rebalance_failed,fts, {agent_died,<25611.3855.0>, {linked_process_died,<25611.20673.216>, {timeout, {gen_server,call, [<25611.3970.0>, {call,"ServiceAPI.StartTopologyChange", #Fun<json_rpc_connection.0.102434519>}, 60000]}}}}}. Rebalance Operation Id = 2218fa8964b3abb2c27f8afdd7765c66 Logs- FTS node-  https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_reb_failed/collectinfo-2019-11-14T121550-ns_1%40172.23.96.20.zip Other nodes-  https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_reb_failed/collectinfo-2019-11-14T121550-ns_1%40172.23.96.23.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_reb_failed/collectinfo-2019-11-14T121550-ns_1%40172.23.97.12.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_reb_failed/collectinfo-2019-11-14T121550-ns_1%40172.23.97.13.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_reb_failed/collectinfo-2019-11-14T121550-ns_1%40172.23.97.177.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_reb_failed/collectinfo-2019-11-14T121550-ns_1%40172.23.97.19.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/fts_reb_failed/collectinfo-2019-11-14T121550-ns_1%40172.23.97.20.zip  
          mahesh.mandhare Mahesh Mandhare (Inactive) made changes -
          Resolution Fixed [ 1 ]
          Status Resolved [ 5 ] Reopened [ 4 ]
          mahesh.mandhare Mahesh Mandhare (Inactive) made changes -
          Assignee Mahesh Mandhare [ mahesh.mandhare ] Abhinav Dangeti [ abhinav ]
          abhinav Abhinav Dangeti added a comment - - edited

          Hmm, looks like StartTopologyChange is taking too long ..

          2019-11-14T03:44:29.294-08:00 [INFO] ctl/manager: StartTopologyChange, change: {67fa3e10e7dc684329cc925064dcc4e7 [] topology-change-rebalance [{{8a65072072f2698215b6624168b91871 0 <nil>} recovery-full}] []}
          ...
          2019-11-14T03:54:48.175-08:00 [INFO] ctl/manager: StartTopologyChange, started 

          (over 10 min)

          Believe it's waiting to acquire one of the locks along its code path. Let me check the code path again.

          Note that this timeout is different from the previous timeouts seen which had to do with GetCurrentTopology and GetTaskList.

          abhinav Abhinav Dangeti added a comment - - edited Hmm, looks like StartTopologyChange is taking too long .. 2019 - 11 -14T03: 44 : 29.294 - 08 : 00 [INFO] ctl/manager: StartTopologyChange, change: {67fa3e10e7dc684329cc925064dcc4e7 [] topology-change-rebalance [{{8a65072072f2698215b6624168b91871 0 <nil>} recovery-full}] []} ... 2019 - 11 -14T03: 54 : 48.175 - 08 : 00 [INFO] ctl/manager: StartTopologyChange, started (over 10 min) Believe it's waiting to acquire one of the locks along its code path. Let me check the code path again. Note that this timeout is different from the previous timeouts seen which had to do with GetCurrentTopology and GetTaskList.
          abhinav Abhinav Dangeti added a comment - - edited

          So doesn't appear to be lock contention. getMovingPartitionsCount which StartTopologyChange does invoke iterates over all 22 buckets (for 10 min), and each of these are err-ing while setting up a CouchbaseBucket instance via go-couchbase. Here're the errors ..

          2019-11-14T03:45:12.175-08:00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http://127.0.0.1:8091, poolName: default, bucketName: bucket-15, sourceParams: "{}", err: Get http://127.0.0.1:8091/pools/default/buckets?v=42730975&uuid=4adff6cb6cbb3f86e7f025cabe97e2e0: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
          2019-11-14T03:45:52.037-08:00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http://127.0.0.1:8091, poolName: default, bucketName: bucket-24, sourceParams: "{}", err: Get http://127.0.0.1:8091/pools/default/buckets?v=42730975&uuid=4adff6cb6cbb3f86e7f025cabe97e2e0: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
          2019-11-14T03:46:35.882-08:00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http://127.0.0.1:8091, poolName: default, bucketName: bucket-1, sourceParams: "{}", err: invalid character '<' looking for beginning of value
          2019-11-14T03:46:59.582-08:00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http://127.0.0.1:8091, poolName: default, bucketName: bucket-22, sourceParams: "{}", err: invalid character '<' looking for beginning of value
          2019-11-14T03:47:22.354-08:00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http://127.0.0.1:8091, poolName: default, bucketName: bucket-25, sourceParams: "{}", err: invalid character '<' looking for beginning of value
          2019-11-14T03:47:34.926-08:00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http://127.0.0.1:8091, poolName: default, bucketName: bucket-3, sourceParams: "{}", err: invalid character '<' looking for beginning of value
          2019-11-14T03:47:52.973-08:00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http://127.0.0.1:8091, poolName: default, bucketName: bucket-6, sourceParams: "{}", err: Get http://127.0.0.1:8091/pools/default/buckets?v=42730975&uuid=4adff6cb6cbb3f86e7f025cabe97e2e0: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
          2019-11-14T03:48:05.881-08:00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http://127.0.0.1:8091, poolName: default, bucketName: bucket-9, sourceParams: "{}", err: invalid character '<' looking for beginning of value
          2019-11-14T03:48:29.018-08:00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http://127.0.0.1:8091, poolName: default, bucketName: bucket-19, sourceParams: "{}", err: invalid character '<' looking for beginning of value
          2019-11-14T03:49:15.248-08:00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http://127.0.0.1:8091, poolName: default, bucketName: bucket-27, sourceParams: "{}", err: invalid character '<' looking for beginning of value
          2019-11-14T03:49:53.907-08:00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http://127.0.0.1:8091, poolName: default, bucketName: bucket-29, sourceParams: "{}", err: invalid character '<' looking for beginning of value
          2019-11-14T03:50:09.842-08:00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http://127.0.0.1:8091, poolName: default, bucketName: bucket-10, sourceParams: "{}", err: invalid character '<' looking for beginning of value
          2019-11-14T03:50:45.301-08:00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http://127.0.0.1:8091, poolName: default, bucketName: bucket-26, sourceParams: "{}", err: invalid character '<' looking for beginning of value
          2019-11-14T03:51:11.619-08:00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http://127.0.0.1:8091, poolName: default, bucketName: bucket-8, sourceParams: "{}", err: invalid character '<' looking for beginning of value
          2019-11-14T03:52:07.535-08:00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http://127.0.0.1:8091, poolName: default, bucketName: bucket-21, sourceParams: "{}", err: invalid character '<' looking for beginning of value
          2019-11-14T03:52:19.152-08:00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http://127.0.0.1:8091, poolName: default, bucketName: bucket-17, sourceParams: "{}", err: invalid character '<' looking for beginning of value
          2019-11-14T03:52:50.972-08:00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http://127.0.0.1:8091, poolName: default, bucketName: bucket-30, sourceParams: "{}", err: invalid character '<' looking for beginning of value
          2019-11-14T03:53:31.396-08:00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http://127.0.0.1:8091, poolName: default, bucketName: bucket-13, sourceParams: "{}", err: invalid character '<' looking for beginning of value
          2019-11-14T03:53:56.495-08:00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http://127.0.0.1:8091, poolName: default, bucketName: bucket-14, sourceParams: "{}", err: invalid character '<' looking for beginning of value
          2019-11-14T03:54:07.120-08:00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http://127.0.0.1:8091, poolName: default, bucketName: bucket-12, sourceParams: "{}", err: Get http://127.0.0.1:8091/pools/default/buckets?v=42730975&uuid=4adff6cb6cbb3f86e7f025cabe97e2e0: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
          2019-11-14T03:54:18.852-08:00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http://127.0.0.1:8091, poolName: default, bucketName: bucket-18, sourceParams: "{}", err: invalid character '<' looking for beginning of value
          2019-11-14T03:54:48.175-08:00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http://127.0.0.1:8091, poolName: default, bucketName: bucket-7, sourceParams: "{}", err: invalid character '<' looking for beginning of value
          2019-11-14T03:54:48.175-08:00 [INFO] ctl: getMovingPartitionsCount: 0 

          It seems like this is a bug in go-couchbase where refresh is invoked on the pool with its URI not set correctly.

          Sreekanth Sivasankaran made this change: http://review.couchbase.org/#/c/117634, which is relevant here.

          abhinav Abhinav Dangeti added a comment - - edited So doesn't appear to be lock contention. getMovingPartitionsCount which StartTopologyChange does invoke iterates over all 22 buckets (for 10 min) , and each of these are err-ing while setting up a CouchbaseBucket instance via go-couchbase. Here're the errors .. 2019 - 11 -14T03: 45 : 12.175 - 08 : 00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http: //127.0.0.1:8091, poolName: default, bucketName: bucket-15, sourceParams: "{}", err: Get http://127.0.0.1:8091/pools/default/buckets?v=42730975&uuid=4adff6cb6cbb3f86e7f025cabe97e2e0: net/http: request canceled (Client.Timeout exceeded while awaiting headers) 2019 - 11 -14T03: 45 : 52.037 - 08 : 00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http: //127.0.0.1:8091, poolName: default, bucketName: bucket-24, sourceParams: "{}", err: Get http://127.0.0.1:8091/pools/default/buckets?v=42730975&uuid=4adff6cb6cbb3f86e7f025cabe97e2e0: net/http: request canceled (Client.Timeout exceeded while awaiting headers) 2019 - 11 -14T03: 46 : 35.882 - 08 : 00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http: //127.0.0.1:8091, poolName: default, bucketName: bucket-1, sourceParams: "{}", err: invalid character '<' looking for beginning of value 2019 - 11 -14T03: 46 : 59.582 - 08 : 00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http: //127.0.0.1:8091, poolName: default, bucketName: bucket-22, sourceParams: "{}", err: invalid character '<' looking for beginning of value 2019 - 11 -14T03: 47 : 22.354 - 08 : 00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http: //127.0.0.1:8091, poolName: default, bucketName: bucket-25, sourceParams: "{}", err: invalid character '<' looking for beginning of value 2019 - 11 -14T03: 47 : 34.926 - 08 : 00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http: //127.0.0.1:8091, poolName: default, bucketName: bucket-3, sourceParams: "{}", err: invalid character '<' looking for beginning of value 2019 - 11 -14T03: 47 : 52.973 - 08 : 00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http: //127.0.0.1:8091, poolName: default, bucketName: bucket-6, sourceParams: "{}", err: Get http://127.0.0.1:8091/pools/default/buckets?v=42730975&uuid=4adff6cb6cbb3f86e7f025cabe97e2e0: net/http: request canceled (Client.Timeout exceeded while awaiting headers) 2019 - 11 -14T03: 48 : 05.881 - 08 : 00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http: //127.0.0.1:8091, poolName: default, bucketName: bucket-9, sourceParams: "{}", err: invalid character '<' looking for beginning of value 2019 - 11 -14T03: 48 : 29.018 - 08 : 00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http: //127.0.0.1:8091, poolName: default, bucketName: bucket-19, sourceParams: "{}", err: invalid character '<' looking for beginning of value 2019 - 11 -14T03: 49 : 15.248 - 08 : 00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http: //127.0.0.1:8091, poolName: default, bucketName: bucket-27, sourceParams: "{}", err: invalid character '<' looking for beginning of value 2019 - 11 -14T03: 49 : 53.907 - 08 : 00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http: //127.0.0.1:8091, poolName: default, bucketName: bucket-29, sourceParams: "{}", err: invalid character '<' looking for beginning of value 2019 - 11 -14T03: 50 : 09.842 - 08 : 00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http: //127.0.0.1:8091, poolName: default, bucketName: bucket-10, sourceParams: "{}", err: invalid character '<' looking for beginning of value 2019 - 11 -14T03: 50 : 45.301 - 08 : 00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http: //127.0.0.1:8091, poolName: default, bucketName: bucket-26, sourceParams: "{}", err: invalid character '<' looking for beginning of value 2019 - 11 -14T03: 51 : 11.619 - 08 : 00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http: //127.0.0.1:8091, poolName: default, bucketName: bucket-8, sourceParams: "{}", err: invalid character '<' looking for beginning of value 2019 - 11 -14T03: 52 : 07.535 - 08 : 00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http: //127.0.0.1:8091, poolName: default, bucketName: bucket-21, sourceParams: "{}", err: invalid character '<' looking for beginning of value 2019 - 11 -14T03: 52 : 19.152 - 08 : 00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http: //127.0.0.1:8091, poolName: default, bucketName: bucket-17, sourceParams: "{}", err: invalid character '<' looking for beginning of value 2019 - 11 -14T03: 52 : 50.972 - 08 : 00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http: //127.0.0.1:8091, poolName: default, bucketName: bucket-30, sourceParams: "{}", err: invalid character '<' looking for beginning of value 2019 - 11 -14T03: 53 : 31.396 - 08 : 00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http: //127.0.0.1:8091, poolName: default, bucketName: bucket-13, sourceParams: "{}", err: invalid character '<' looking for beginning of value 2019 - 11 -14T03: 53 : 56.495 - 08 : 00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http: //127.0.0.1:8091, poolName: default, bucketName: bucket-14, sourceParams: "{}", err: invalid character '<' looking for beginning of value 2019 - 11 -14T03: 54 : 07.120 - 08 : 00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http: //127.0.0.1:8091, poolName: default, bucketName: bucket-12, sourceParams: "{}", err: Get http://127.0.0.1:8091/pools/default/buckets?v=42730975&uuid=4adff6cb6cbb3f86e7f025cabe97e2e0: net/http: request canceled (Client.Timeout exceeded while awaiting headers) 2019 - 11 -14T03: 54 : 18.852 - 08 : 00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http: //127.0.0.1:8091, poolName: default, bucketName: bucket-18, sourceParams: "{}", err: invalid character '<' looking for beginning of value 2019 - 11 -14T03: 54 : 48.175 - 08 : 00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http: //127.0.0.1:8091, poolName: default, bucketName: bucket-7, sourceParams: "{}", err: invalid character '<' looking for beginning of value 2019 - 11 -14T03: 54 : 48.175 - 08 : 00 [INFO] ctl: getMovingPartitionsCount: 0 It seems like this is a bug in go-couchbase where refresh is invoked on the pool with its URI not set correctly. Sreekanth Sivasankaran  made this change:  http://review.couchbase.org/#/c/117634 , which is relevant here.

          This new issue is now related to .. MB-34931.

          abhinav Abhinav Dangeti added a comment - This new issue is now related to ..  MB-34931 .

          When parseURLResponse within Client's GetPool(..) times out and we don't do anything about it (INCORRECTLY), the Pool's BucketURL map is empty => when parseURLResponse is invoked again on an empty "path", we'd get a HTML dump which the json decoder fails to recognize.

          This here change within go-couchbase addresses that .. http://review.couchbase.org/#/c/117634 (merged).

          As for the amount of time spent within the getMovingPartitionsCount API within cbgt to iterate over all the index definitions/buckets (--> 10s per index definition which is the timeout setting for the HTTP client), which is causing StartTopologyChange API to timeout (60s), here's a fix that would avoid waiting for multiple requests to timeout and that returns when the first timeout is seen .. http://review.couchbase.org/#/c/117921/

          abhinav Abhinav Dangeti added a comment - When parseURLResponse within Client's GetPool(..) times out and we don't do anything about it (INCORRECTLY), the Pool's BucketURL map is empty => when parseURLResponse is invoked again on an empty "path", we'd get a HTML dump which the json decoder fails to recognize. This here change within go-couchbase addresses that .. http://review.couchbase.org/#/c/117634  (merged). As for the amount of time spent within the getMovingPartitionsCount API within cbgt to iterate over all the index definitions/buckets (--> 10s per index definition which is the timeout setting for the HTTP client), which is causing StartTopologyChange API to timeout (60s), here's a fix that would avoid waiting for multiple requests to timeout and that returns when the first timeout is seen ..  http://review.couchbase.org/#/c/117921/
          Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - - edited

          hey Abhinav Dangeti/ Steve Yen, my thinking here is that - all these fixes are not going to fix the rebalance or make it work. Rather than ns_server erring out, we will error out saying those  "pool/endpoints" or relevant endpoints aren't working for FTS to complete the rebalance.

          This is one area which I was trying to bring up in the scrum call that we can't just make these rebalance codepaths entirely resilient/robust to external call outcomes.  (referring to the resiliency of the rebalance operation as a whole not any specific API calls)

          For the system rebalance to reach such a stable stage, all the external systems/services need to behave predictably/consistently. If not, we will end up failing rebalances for the user. And for the customer, it doesn't matter whether StartTopology timed out or an FTS GetPool call.

          From the high density test outcomes so far, with these endpoints(eg: pools/blah/blah/) kept failing erratically its not fair to conclude yet that we support 30 buckets/high bucket density even from a CB perspective.

           

          Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - - edited hey  Abhinav Dangeti / Steve Yen , my thinking here is that - all these fixes are not going to fix the rebalance or make it work. Rather than ns_server erring out, we will error out saying those  "pool/endpoints" or relevant endpoints aren't working for FTS to complete the rebalance. This is one area which I was trying to bring up in the scrum call that we can't just make these rebalance codepaths entirely resilient/robust to external call outcomes.  (referring to the resiliency of the rebalance operation as a whole not any specific API calls) For the system rebalance to reach such a stable stage, all the external systems/services need to behave predictably/consistently. If not, we will end up failing rebalances for the user. And for the customer, it doesn't matter whether StartTopology timed out or an FTS GetPool call. From the high density test outcomes so far, with these endpoints(eg: pools/blah/blah/) kept failing erratically its not fair to conclude yet that we support 30 buckets/high bucket density even from a CB perspective.  

          What is the timeout that FTS uses for these requests?

          Aliaksey Artamonau Aliaksey Artamonau (Inactive) added a comment - What is the timeout that FTS uses for these requests?

          10 seconds

          abhinav Abhinav Dangeti added a comment - 10 seconds
          Sreekanth Sivasankaran Sreekanth Sivasankaran made changes -
          Assignee Abhinav Dangeti [ abhinav ] Mahesh Mandhare [ mahesh.mandhare ]
          Sreekanth Sivasankaran Sreekanth Sivasankaran made changes -
          Assignee Mahesh Mandhare [ mahesh.mandhare ] Abhinav Dangeti [ abhinav ]

          Here's an update on the change .. http://review.couchbase.org/#/c/117921/2

          This change is to error out early in case a request timeout is seen, rather than attempt sending more requests in (which could also potentially timeout) in which case we would be updating the moving partitions count incorrectly anyway.

          Note that with this change - rebalance would likely still fail in the scenario above - instead of StartTopologyChange timing out as ns_server didn't hear back from FTS within 60s, it would show an error saying that the preparation phase of StartTopologyChange failed because of a request timeout to ns_server from FTS. A retry attempt on the rebalance will pick up things where we last left off and could potentially succeed iff ns_server responds to FTS requests before they're cancelled due to timeout (10s).

          2019-11-14T03:45:12.175-08:00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http://127.0.0.1:8091, poolName: default, bucketName: bucket-15, sourceParams: "{}", err: Get http://127.0.0.1:8091/pools/default/buckets?v=42730975&uuid=4adff6cb6cbb3f86e7f025cabe97e2e0: net/http: request canceled (Client.Timeout exceeded while awaiting headers) 

          However a retest is advised even before merging the change, with the latest mad-hatter build that includes (http://review.couchbase.org/#/c/117634 + http://review.couchbase.org/#/c/118041/) which address the issue where FTS sends numerous incorrect requests to ns_server which result in these log messages ..

          2019-11-14T03:46:35.882-08:00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http://127.0.0.1:8091, poolName: default, bucketName: bucket-1, sourceParams: "{}", err: invalid character '<' looking for beginning of value

          Ping Mahesh Mandhare.

          abhinav Abhinav Dangeti added a comment - Here's an update on the change ..  http://review.couchbase.org/#/c/117921/2 This change is to error out early in case a request timeout is seen, rather than attempt sending more requests in (which could also potentially timeout) in which case we would be updating the moving partitions count incorrectly anyway. Note that with this change - rebalance would likely still fail in the scenario above - instead of StartTopologyChange timing out as ns_server didn't hear back from FTS within 60s, it would show an error saying that the preparation phase of StartTopologyChange failed because of a request timeout to ns_server from FTS. A retry attempt on the rebalance will pick up things where we last left off and could potentially succeed iff ns_server responds to FTS requests before they're cancelled due to timeout (10s). 2019 - 11 -14T03: 45 : 12.175 - 08 : 00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http: //127.0.0.1:8091, poolName: default, bucketName: bucket-15, sourceParams: "{}", err: Get http://127.0.0.1:8091/pools/default/buckets?v=42730975&uuid=4adff6cb6cbb3f86e7f025cabe97e2e0: net/http: request canceled (Client.Timeout exceeded while awaiting headers) However a retest is advised even before merging the change, with the latest mad-hatter build that includes ( http://review.couchbase.org/#/c/117634  +  http://review.couchbase.org/#/c/118041/ ) which address the issue where FTS sends numerous incorrect requests to ns_server which result in these log messages .. 2019 - 11 -14T03: 46 : 35.882 - 08 : 00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: gocouchbase_helper: CouchbaseBucket failed GetPool, server: http: //127.0.0.1:8091, poolName: default, bucketName: bucket-1, sourceParams: "{}", err: invalid character '<' looking for beginning of value Ping Mahesh Mandhare .
          lynn.straus Lynn Straus made changes -
          Fix Version/s Cheshire-Cat [ 15915 ]
          Fix Version/s Mad-Hatter [ 15037 ]
          lynn.straus Lynn Straus made changes -
          Labels high-bucket-density 6.5.x-candidate deferred-from-Mad-Hatter high-bucket-density
          abhinav Abhinav Dangeti made changes -
          Link This issue relates to MB-34931 [ MB-34931 ]
          abhinav Abhinav Dangeti made changes -
          Link This issue relates to MB-36400 [ MB-36400 ]
          wayne Wayne Siu made changes -
          Labels 6.5.x-candidate deferred-from-Mad-Hatter high-bucket-density deferred-from-Mad-Hatter high-bucket-density

          Jyotsna Nayak: can you please run the high bucket density test(with 30 buckets) again to see if this is reproducible? Thanks.

          keshav Keshav Murthy added a comment - Jyotsna Nayak : can you please run the high bucket density test(with 30 buckets) again to see if this is reproducible? Thanks.
          keshav Keshav Murthy made changes -
          Assignee Abhinav Dangeti [ abhinav ] Jyotsna Nayak [ jyotsna.nayak ]
          wayne Wayne Siu made changes -
          Summary FTS: Rebalance failed after failover in high bucket density test [high-bucket] FTS: Rebalance failed after failover in high bucket density test
          keshav Keshav Murthy made changes -
          Priority Critical [ 2 ] Major [ 3 ]
          keshav Keshav Murthy made changes -
          Fix Version/s 7.0.1 [ 17104 ]
          Fix Version/s Cheshire-Cat [ 15915 ]
          abhinav Abhinav Dangeti made changes -
          Fix Version/s Neo [ 17615 ]
          Fix Version/s 7.0.1 [ 17104 ]
          jyotsna.nayak Jyotsna Nayak added a comment - - edited

          Raju Suravarjjala , sorry for the delay. I had run the test two weeks back but got a mismatch error with was not captured in the cbcollect logs or the logs on the UI. I ran the test on 2 buckets for build 7.0.2 and 7.1.0. The test passed for both the builds with no issues.

          I have scheduled the test for 30 buckets (It is in the queue).
          Link : here

          jyotsna.nayak Jyotsna Nayak added a comment - - edited Raju Suravarjjala  , sorry for the delay. I had run the test two weeks back but got a mismatch error with was not captured in the cbcollect logs or the logs on the UI. I ran the test on 2 buckets for build 7.0.2 and 7.1.0. The test passed for both the builds with no issues. I have scheduled the test for 30 buckets (It is in the queue). Link : here

          Closing this issue as we are not seeing this on CC build: 7.0.0-4554.
          Swap rebalance completed. Time taken for swap rebalance : 468.9639 min (~7.81 hrs)
          Link to the job: http://perf.jenkins.couchbase.com/job/themis_multibucket/63/consoleFull

          jyotsna.nayak Jyotsna Nayak added a comment - Closing this issue as we are not seeing this on CC build: 7.0.0-4554. Swap rebalance completed. Time taken for swap rebalance : 468.9639 min (~7.81 hrs) Link to the job: http://perf.jenkins.couchbase.com/job/themis_multibucket/63/consoleFull
          jyotsna.nayak Jyotsna Nayak made changes -
          Resolution Fixed [ 1 ]
          Status Reopened [ 4 ] Closed [ 6 ]

          People

            jyotsna.nayak Jyotsna Nayak
            mahesh.mandhare Mahesh Mandhare (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              PagerDuty