Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-46317

[Volume Test] KV rebalance out failed due to ServiceAPI.GetTaskList timeout

    XMLWordPrintable

Details

    Description

      Build : 7.0.0-5161
      Test : -test tests/2i/cheshirecat/test_idx_cc_vol_10K.yml -scope tests/2i/cheshirecat/scope_idx_cc_vol_10K.yml
      Scale : 5
      Iteration : 1st

      Rebalance out of KV node 172.23.96.14 failed due to the following error :

      [ns_server:error,2021-05-14T21:28:08.652-07:00,ns_1@172.23.120.58:service_rebalancer-index<0.23744.459>:service_rebalancer:run_rebalance_worker:130]Agent terminated during the rebalance: {'DOWN',
                                              #Ref<0.772317819.1189609476.126010>,
                                              process,<29859.890.6>,
                                              {linked_process_died,
                                               <29859.14034.446>,
                                               {'ns_1@172.23.123.31',
                                                {timeout,
                                                 {gen_server,call,
                                                  [<29859.1100.6>,
                                                   {call,"ServiceAPI.GetTaskList",
                                                    #Fun<json_rpc_connection.0.77329884>},
                                                   60000]}}}}}
      [ns_server:error,2021-05-14T21:28:08.654-07:00,ns_1@172.23.120.58:service_agent-index<0.30446.446>:service_agent:handle_info:281]Rebalancer <0.23744.459> died unexpectedly: {agent_died,<29859.890.6>,
                                                   {linked_process_died,
                                                    <29859.14034.446>,
                                                    {'ns_1@172.23.123.31',
                                                     {timeout,
                                                      {gen_server,call,
                                                       [<29859.1100.6>,
                                                        {call,
                                                         "ServiceAPI.GetTaskList",
                                                         #Fun<json_rpc_connection.0.77329884>},
                                                        60000]}}}}}
      [user:error,2021-05-14T21:28:08.834-07:00,ns_1@172.23.120.58:<0.29127.446>:ns_orchestrator:log_rebalance_completion:1416]Rebalance exited with reason {service_rebalance_failed,index,
                                    {agent_died,<29859.890.6>,
                                     {linked_process_died,<29859.14034.446>,
                                      {'ns_1@172.23.123.31',
                                       {timeout,
                                        {gen_server,call,
                                         [<29859.1100.6>,
                                          {call,"ServiceAPI.GetTaskList",
                                           #Fun<json_rpc_connection.0.77329884>},
                                          60000]}}}}}}.
      Rebalance Operation Id = a21ff0e40e2a32d9787d17966efd45ca
      

      On 172.23.123.31, following is seen in the debug logs :

      [ns_server:warn,2021-05-14T21:27:55.955-07:00,ns_1@172.23.123.31:ns_server_monitor<0.25192.462>:health_monitor:send_heartbeat_inner:157]ns_server_monitor send heartbeat timed out
       
      [ns_server:error,2021-05-14T21:28:00.212-07:00,ns_1@172.23.123.31:service_status_keeper_worker<0.27076.5>:rest_utils:get_json:57]Request to (indexer) getIndexStatus with headers [{"If-None-Match",
                                                         "4cec15fd7844af82"}] failed: {error,
                                                                                       timeout}
      [ns_server:error,2021-05-14T21:28:00.212-07:00,ns_1@172.23.123.31:service_status_keeper-index<0.27077.5>:service_status_keeper:handle_cast:103]Service service_index returned incorrect status
      [ns_server:debug,2021-05-14T21:28:00.213-07:00,ns_1@172.23.123.31:compaction_daemon<0.27091.5>:compaction_daemon:process_scheduler_message:1302]No buckets to compact for compact_kv. Rescheduling compaction.
      [ns_server:debug,2021-05-14T21:28:00.213-07:00,ns_1@172.23.123.31:compaction_daemon<0.27091.5>:compaction_scheduler:schedule_next:51]Finished compaction for compact_kv too soon. Next run will be in 30s
      [ns_server:debug,2021-05-14T21:28:02.284-07:00,ns_1@172.23.123.31:compaction_daemon<0.27091.5>:compaction_daemon:process_scheduler_message:1302]No buckets to compact for compact_views. Rescheduling compaction.
      [ns_server:debug,2021-05-14T21:28:02.284-07:00,ns_1@172.23.123.31:compaction_daemon<0.27091.5>:compaction_scheduler:schedule_next:51]Finished compaction for compact_views too soon. Next run will be in 30s
      [ns_server:debug,2021-05-14T21:28:02.896-07:00,ns_1@172.23.123.31:ldap_auth_cache<0.26693.5>:active_cache:cleanup:225]Cache ldap_auth_cache cleanup: 0/0 records deleted
      [ns_server:info,2021-05-14T21:28:05.013-07:00,ns_1@172.23.123.31:ns_config_rep<0.26830.5>:ns_config_rep:pull_one_node:377]Pulling config from: 'ns_1@172.23.97.241'
      [error_logger:error,2021-05-14T21:28:07.440-07:00,ns_1@172.23.123.31:<0.14034.446>:ale_error_logger_handler:do_log:101]
      =========================CRASH REPORT=========================
        crasher:
          initial call: service_agent:'-start_long_poll_worker/4-fun-0-'/0
          pid: <0.14034.446>
          registered_name: []
          exception exit: {timeout,
                              {gen_server,call,
                                  [<0.1100.6>,
                                   {call,"ServiceAPI.GetTaskList",
                                       #Fun<json_rpc_connection.0.77329884>},
                                   60000]}}
            in function  gen_server:call/3 (gen_server.erl, line 223)
            in call from service_api:perform_call/3 (src/service_api.erl, line 49)
            in call from service_agent:grab_tasks/2 (src/service_agent.erl, line 568)
            in call from service_agent:long_poll_worker_loop/5 (src/service_agent.erl, line 649)
          ancestors: ['service_agent-index',service_agent_children_sup,
                        service_agent_sup,ns_server_sup,ns_server_nodes_sup,
                        <0.26637.5>,ns_server_cluster_sup,root_sup,<0.139.0>]
          message_queue_len: 0
          messages: []
          links: [<0.890.6>]
          dictionary: []
          trap_exit: false
          status: running
          heap_size: 1598
          stack_size: 27
          reductions: 10982
        neighbours:
       
      [ns_server:error,2021-05-14T21:28:07.440-07:00,ns_1@172.23.123.31:service_agent-index<0.890.6>:service_agent:handle_info:277]Linked process <0.14034.446> died with reason {timeout,
                                                     {gen_server,call,
                                                      [<0.1100.6>,
                                                       {call,
                                                        "ServiceAPI.GetTaskList",
                                                        #Fun<json_rpc_connection.0.77329884>},
                                                       60000]}}. Terminating
      [chronicle:debug,2021-05-14T21:28:08.615-07:00,ns_1@172.23.123.31:chronicle_leader<0.26593.5>:chronicle_leader:handle_state_timeout:562]State timeout when state is: {follower,'ns_1@172.23.97.74',
                                       <<"9fdde511ed879690b2e04aba53cba39c">>,
                                       {4,'ns_1@172.23.97.74'},
                                       established}
      [ns_server:error,2021-05-14T21:28:08.647-07:00,ns_1@172.23.123.31:service_agent-index<0.890.6>:service_agent:terminate:306]Terminating abnormally
      [ns_server:error,2021-05-14T21:28:08.648-07:00,ns_1@172.23.123.31:service_agent-index<0.890.6>:service_agent:terminate:311]Terminating json rpc connection for index: <0.1100.6>
      [error_logger:error,2021-05-14T21:28:08.649-07:00,ns_1@172.23.123.31:service_agent-index<0.890.6>:ale_error_logger_handler:do_log:101]
      =========================ERROR REPORT=========================
      ** Generic server 'service_agent-index' terminating
      ** Last message in was {'EXIT',<0.14034.446>,
                              {timeout,
                               {gen_server,call,
                                [<0.1100.6>,
                                 {call,"ServiceAPI.GetTaskList",
                                  #Fun<json_rpc_connection.0.77329884>},
                                 60000]}}}
      ** When Server state == {state,index,
                               {dict,60,16,16,8,80,48,
      ...
      ...
      [error_logger:error,2021-05-14T21:28:08.652-07:00,ns_1@172.23.123.31:service_agent-index<0.890.6>:ale_error_logger_handler:do_log:101]
      =========================CRASH REPORT=========================
        crasher:
          initial call: service_agent:init/1
          pid: <0.890.6>
          registered_name: 'service_agent-index'
          exception exit: {linked_process_died,<0.14034.446>,
                           {'ns_1@172.23.123.31',
                            {timeout,
                             {gen_server,call,
                              [<0.1100.6>,
                               {call,"ServiceAPI.GetTaskList",
                                #Fun<json_rpc_connection.0.77329884>},
                               60000]}}}}
            in function  gen_server:handle_common_reply/8 (gen_server.erl, line 751)
          ancestors: [service_agent_children_sup,service_agent_sup,ns_server_sup,
                        ns_server_nodes_sup,<0.26637.5>,ns_server_cluster_sup,
                        root_sup,<0.139.0>]
          message_queue_len: 3
      

      Nodes with index service : 172.23.106.136, 172.23.120.58, 172.23.120.74, 172.23.120.75, 172.23.120.77, 172.23.120.81, 172.23.120.86, 172.23.123.26, 172.23.123.31, 172.23.123.32, 172.23.123.33, 172.23.96.243, 172.23.96.254, 172.23.97.105, 172.23.97.110, 172.23.97.112

      Nodes with kv service : 172.23.120.73, 172.23.121.77, 172.23.123.24, 172.23.96.122, 172.23.96.14, 172.23.96.48, 172.23.97.241, 172.23.97.74

       

      [Kevin Cherkauer  added: Earlier logs for node 123.31 from same test linked in MB-46318:

      Collected at around 2021-05-14 9 PM : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1621051816/collectinfo-2021-05-15T041026-ns_1%40172.23.123.31.zip
      Collected at around 2021-05-14 7 PM : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1621044363/collectinfo-2021-05-15T020611-ns_1%40172.23.123.31.zip

      ]

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            kevin.cherkauer Kevin Cherkauer (Inactive)
            mihir.kamdar Mihir Kamdar (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty