Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-44728

[System Test] : Rebalance failing with (err=Collection Not Found)

    XMLWordPrintable

Details

    Description

      Build : 7.0.0-4554
      Test : -test tests/2i/cheshirecat/test_idx_clusterops_cheshire_cat.yml -scope tests/2i/cheshirecat/scope_idx_cheshire_cat.yml
      Scale : 2
      Iteration : 4th

      This could be similar to MB-44627, but I dont see the same symptoms as mentioned in the ticket. Also this issue is seen only once so far in the test that has run for 40 hrs.

      Rebalance operation to remove indexer node 172.23.105.186 failed at 2021-03-03T04:55:06.

      [2021-03-03T04:34:26-08:00, sequoiatools/couchbase-cli:7.0:e78c8a] rebalance -c 172.23.106.253:8091 --server-remove 172.23.105.186 -u Administrator -p password
       
      Error occurred on container - sequoiatools/couchbase-cli:7.0:[rebalance -c 172.23.106.253:8091 --server-remove 172.23.105.186 -u Administrator -p password]
       
      docker logs e78c8a
      docker start e78c8a
       
      *Unable to display progress bar on this os
      JERROR: Rebalance failed. See logs for detailed reason. You can try again.
      

      The error on the master node is -

      [ns_server:error,2021-03-03T04:55:06.461-08:00,ns_1@172.23.106.253:service_rebalancer-index<0.6629.933>:service_rebalancer:run_rebalance_worker:136]Agent terminated during the rebalance: {'DOWN',
                                              #Ref<0.3071195499.1647837191.242955>,
                                              process,<29971.3312.233>,
                                              {linked_process_died,
                                               <29971.4242.282>,
                                               {timeout,
                                                {gen_server,call,
                                                 [<29971.4413.233>,
                                                  {call,
                                                   "ServiceAPI.GetCurrentTopology",
                                                   #Fun<json_rpc_connection.0.44122352>},
                                                  60000]}}}}
      [user:error,2021-03-03T04:55:06.464-08:00,ns_1@172.23.106.253:<0.9291.0>:ns_orchestrator:log_rebalance_completion:1407]Rebalance exited with reason {service_rebalance_failed,index,
                                    {agent_died,<29971.3312.233>,
                                     {linked_process_died,<29971.4242.282>,
                                      {timeout,
                                       {gen_server,call,
                                        [<29971.4413.233>,
                                         {call,"ServiceAPI.GetCurrentTopology",
                                          #Fun<json_rpc_connection.0.44122352>},
                                         60000]}}}}}.
      Rebalance Operation Id = 1530e97a9762d95dd9e3aa32d540c851
      

      Upon checking all the indexer nodes, it seems like the node 172.23.106.255 had an issue. This is from the debug.log of 172.23.106.255 -

      [error_logger:error,2021-03-03T04:55:06.457-08:00,ns_1@172.23.106.255:<0.4242.282>:ale_error_logger_handler:do_log:107]
      =========================CRASH REPORT=========================
        crasher:
          initial call: service_agent:'-start_long_poll_worker/4-fun-0-'/0
          pid: <0.4242.282>
          registered_name: []
          exception exit: {timeout,
                              {gen_server,call,
                                  [<0.4413.233>,
                                   {call,"ServiceAPI.GetCurrentTopology",
                                       #Fun<json_rpc_connection.0.44122352>},
                                   60000]}}
            in function  gen_server:call/3 (gen_server.erl, line 223)
            in call from service_api:perform_call/3 (src/service_api.erl, line 55)
            in call from service_agent:grab_topology/2 (src/service_agent.erl, line 590)
            in call from service_agent:long_poll_worker_loop/5 (src/service_agent.erl, line 655)
          ancestors: ['service_agent-index',service_agent_children_sup,
                        service_agent_sup,ns_server_sup,ns_server_nodes_sup,
                        <0.7883.0>,ns_server_cluster_sup,root_sup,<0.138.0>]
          message_queue_len: 0
          messages: []
          links: [<0.3312.233>]
          dictionary: []
          trap_exit: false
          status: running
          heap_size: 1598
          stack_size: 27
          reductions: 20262
        neighbours:
       
      [ns_server:error,2021-03-03T04:55:06.458-08:00,ns_1@172.23.106.255:service_agent-index<0.3312.233>:service_agent:handle_info:283]Linked process <0.4242.282> died with reason {timeout,
                                                    {gen_server,call,
                                                     [<0.4413.233>,
                                                      {call,
                                                       "ServiceAPI.GetCurrentTopology",
                                                       #Fun<json_rpc_connection.0.44122352>},
                                                      60000]}}. Terminating
      [ns_server:error,2021-03-03T04:55:06.458-08:00,ns_1@172.23.106.255:service_agent-index<0.3312.233>:service_agent:terminate:312]Terminating abnormally
      [ns_server:error,2021-03-03T04:55:06.458-08:00,ns_1@172.23.106.255:service_agent-index<0.3312.233>:service_agent:terminate:317]Terminating json rpc connection for index: <0.4413.233>
      [error_logger:error,2021-03-03T04:55:06.458-08:00,ns_1@172.23.106.255:service_agent-index<0.3312.233>:ale_error_logger_handler:do_log:107]
      =========================ERROR REPORT=========================
      ** Generic server 'service_agent-index' terminating
      ** Last message in was {'EXIT',<0.4242.282>,
                              {timeout,
                               {gen_server,call,
                                [<0.4413.233>,
                                 {call,"ServiceAPI.GetCurrentTopology",                            #Fun<json_rpc_connection.0.44122352>},
                                 60000]}}}
      ** When Server state == {state,index,
                               {dict,24,16,16,8,80,48,
                                {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
                                {{[[{uuid,<<"e2c2bfa8d14931a5560f871b2a042546">>}|
                                    'ns_1@172.23.107.89'],
                                   [{uuid,<<"ad14f6d539c028084e7b002f6d3deacf">>}|
                                    'ns_1@172.23.97.213']],
                                  [],[],
                                  [[{node,'ns_1@172.23.97.214'}|
                                    <<"8c6944b13aca3fd1c0cbf52fd269eeef">>],
                                   [{node,'ns_1@172.23.106.154'}|
                                    <<"f803a0f259225e0f1ead3adc0f7b9e49">>]],
                                  [],
                                  [[{uuid,<<"b9e888e6147fea51b3e5f081cbb1a64e">>}|
                                    'ns_1@172.23.105.185']],
                                  [[{uuid,<<"a3641e2d6b8632676af5030ad2000433">>}|
                                    'ns_1@172.23.106.242']],
                                  [[{uuid,<<"3f7f791a1133d2219c15b421fd081794">>}|
                                    'ns_1@172.23.106.243'],
                                   [{uuid,<<"5797bb88a535aea59fa01ffdba22a0b6">>}|
                                    'ns_1@172.23.106.255'],
                                  ...
                                  ...
                                  ...
      ** Reason for termination ==
      ** {linked_process_died,<0.4242.282>,
             {timeout,
                 {gen_server,call,
                     [<0.4413.233>,
                      {call,"ServiceAPI.GetCurrentTopology",
                          #Fun<json_rpc_connection.0.44122352>},
                      60000]}}}
      

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            kevin.cherkauer Kevin Cherkauer (Inactive)
            mihir.kamdar Mihir Kamdar (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                PagerDuty