Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-44576

[System Test] : Rebalance failing because of linked_process_died ServiceAPI.GetCurrentTopology for indexer

    XMLWordPrintable

Details

    Description

      Build : 7.0.0-4539
      Test : -test tests/integration/cheshirecat/test_cheshirecat_kv_gsi_coll_xdcr_backup_sgw_fts_itemct_txns_eventing_cbas.yml -scope tests/integration/cheshirecat/scope_cheshirecat_with_backup.yml
      Scale : 2
      Iteration : 1st

      The longevity test started at 2021-02-24T16:11. After creating buckets, scopes collections and some initial setup, there is a rebalance operation to remove a data node. This started at 2021-02-24T16:35:48. At 2021-02-24T16:38, this rebalance operation failed because of :

      [ns_server:error,2021-02-24T16:38:44.545-08:00,ns_1@172.23.97.74:service_rebalancer-index<0.16911.24>:service_rebalancer:run_rebalance_worker:136]Agent terminated during the rebalance: {'DOWN',
                                              #Ref<0.2149394143.2317090822.16574>,
                                              process,<29982.26924.0>,
                                              {linked_process_died,<29982.27226.0>,
                                               {timeout,
                                                {gen_server,call,
                                                 [<29982.27014.0>,
                                                  {call,
                                                   "ServiceAPI.GetCurrentTopology",
                                                   #Fun<json_rpc_connection.0.44122352>},
                                                  60000]}}}}
      [user:error,2021-02-24T16:38:44.548-08:00,ns_1@172.23.97.74:<0.23064.0>:ns_orchestrator:log_rebalance_completion:1407]Rebalance exited with reason {service_rebalance_failed,index,
                                    {agent_died,<29982.26924.0>,
                                     {linked_process_died,<29982.27226.0>,
                                      {timeout,
                                       {gen_server,call,
                                        [<29982.27014.0>,
                                         {call,"ServiceAPI.GetCurrentTopology",
                                          #Fun<json_rpc_connection.0.44122352>},
                                         60000]}}}}}.
      Rebalance Operation Id = 4175f148c90a6f4bb009238fdc1d2e1f
      

      On all the indexer nodes, these type of errors can be seen in the error.log. This one is from 172.23.96.243 :

      [ns_server:error,2021-02-24T16:38:43.565-08:00,ns_1@172.23.96.243:service_status_keeper-index<0.24276.0>:service_status_keeper:handle_cast:104]Service service_index returned incorrect status
      [ns_server:error,2021-02-24T16:38:44.547-08:00,ns_1@172.23.96.243:service_agent-index<0.28735.0>:service_agent:handle_info:287]Rebalancer <29344.16911.24> died unexpectedly: {agent_died,<29395.26924.0>,
                                                      {linked_process_died,
                                                       <29395.27226.0>,
                                                       {timeout,
                                                        {gen_server,call,
                                                         [<29395.27014.0>,
                                                          {call,
                                                           "ServiceAPI.GetCurrentTopology",
                                                           #Fun<json_rpc_connection.0.44122352>},
                                                          60000]}}}}
      [ns_server:error,2021-02-24T16:38:48.649-08:00,ns_1@172.23.96.243:service_status_keeper_worker<0.24275.0>:rest_utils:get_json:59]Request to (indexer) getIndexStatus failed: {ok,
                                                   {{500,"Internal Server Error"},
                                                    [{"Content-Length","226"},
                                                     {"Date",
                                                      "Thu, 25 Feb 2021 00:38:48 GMT"},
                                                     {"Content-Type",
                                                      "application/json"}],
                                                    <<"{\"code\":\"error\",\"error\":\"Fail to retrieve cluster-wide metadata from index service\",\"failedNodes\":[\"172.23.96.243:8091\",\"172.23.96.254:8091\",\"172.23.97.105:8091\",\"172.23.97.110:8091\",\"172.23.97.112:8091\",\"172.23.97.148:8091\"]}">>}}
      

      Tracking back in the error.log, this error has started from 2021-02-24T16:16:21.

      On 172.23.97.112's debug log, we see this too -

      [error_logger:error,2021-02-24T16:38:44.544-08:00,ns_1@172.23.97.112:<0.27226.0>:ale_error_logger_handler:do_log:107]
      =========================CRASH REPORT=========================
        crasher:
          initial call: service_agent:'-start_long_poll_worker/4-fun-0-'/0
          pid: <0.27226.0>
          registered_name: []
          exception exit: {timeout,
                              {gen_server,call,
                                  [<0.27014.0>,
                                   {call,"ServiceAPI.GetCurrentTopology",
                                       #Fun<json_rpc_connection.0.44122352>},
                                   60000]}}
            in function  gen_server:call/3 (gen_server.erl, line 223)
            in call from service_api:perform_call/3 (src/service_api.erl, line 55)
            in call from service_agent:grab_topology/2 (src/service_agent.erl, line 590)
            in call from service_agent:long_poll_worker_loop/5 (src/service_agent.erl, line 655)
          ancestors: ['service_agent-index',service_agent_children_sup,
                        service_agent_sup,ns_server_sup,ns_server_nodes_sup,
                        <0.21412.0>,ns_server_cluster_sup,root_sup,<0.138.0>]
          message_queue_len: 0
          messages: []
          links: [<0.26924.0>]
          dictionary: []
          trap_exit: false
          status: running
          heap_size: 2586
          stack_size: 27
          reductions: 4854
        neighbours:
       
      [ns_server:error,2021-02-24T16:38:44.544-08:00,ns_1@172.23.97.112:service_agent-index<0.26924.0>:service_agent:handle_info:283]Linked process <0.27226.0> died with reason {timeout,
                                                   {gen_server,call,
                                                    [<0.27014.0>,
                                                     {call,
                                                      "ServiceAPI.GetCurrentTopology",
                                                      #Fun<json_rpc_connection.0.44122352>},
                                                     60000]}}. Terminating
      [ns_server:error,2021-02-24T16:38:44.545-08:00,ns_1@172.23.97.112:service_agent-index<0.26924.0>:service_agent:terminate:312]Terminating abnormally
      [ns_server:error,2021-02-24T16:38:44.545-08:00,ns_1@172.23.97.112:service_agent-index<0.26924.0>:service_agent:terminate:317]Terminating json rpc connection for index: <0.27014.0>
      [error_logger:error,2021-02-24T16:38:44.545-08:00,ns_1@172.23.97.112:json_rpc_connection_sup<0.251.0>:ale_error_logger_handler:do_log:107]
      

      One more issue that is seen is that indexer memory usage is 98% on all the nodes. The indexes arent even created. See attached screenshot of the indexer memory usage graph. This could be the reason of getIndexStatus not responding.

      The memory rss reported in indexer_stats.log doesnt match -

      [root@localhost logs]# cat indexer_stats.log | grep memory_rss
      2021-02-24T16:12:13.667-08:00 indexer {"avg_resident_percent":0,"cpu_utilization":14.665200146652001,"index_not_found_errcount":0,"indexer_state":"Active","memory_free":21062361088,"memory_quota":22592618496,"memory_rss":136474624,"memory_total":24570773504,"memory_total_storage":7901184,"memory_used":112091136,"memory_used_queue":0,"memory_used_storage":167936,"needs_restart":false,"num_connections":0,"num_cpu_core":8,"num_indexes":0,"num_storage_instances":0,"storage_mode":"plasma","timestamp":"1614211933667260064","timings/stats_response":"73 2365645 262354316239","total_data_size":0,"total_disk_size":0,"uptime":"1m0.594530018s"}
      2021-02-24T16:13:13.667-08:00 indexer {"cpu_utilization":0.4,"memory_free":21221564416,"memory_rss":139624448,"memory_total_storage":8171520,"memory_used":152893440,"timestamp":"1614211993667608098","timings/stats_response":"145 4130395 307964242411","uptime":"2m0.594876694s"}
      2021-02-24T16:14:13.667-08:00 indexer {"cpu_utilization":1.7999999999999998,"memory_free":21229883392,"memory_rss":139997184,"memory_total_storage":8200192,"memory_used":162031616,"timestamp":"1614212053667910719","timings/stats_response":"217 5907270 353612419138","uptime":"3m0.595179948s"}
      2021-02-24T16:15:13.668-08:00 indexer {"cpu_utilization":16.666666666666664,"memory_free":21051469824,"memory_rss":230797312,"memory_total_storage":9506816,"memory_used":122316800,"memory_used_storage":401408,"num_connections":2,"num_indexes":1,"num_storage_instances":1,"timestamp":"1614212113668187720","timings/stats_response":"295 1646921227 2675255019827900535","total_data_size":16384,"total_disk_size":16384,"uptime":"4m0.595456286s"}
      2021-02-24T16:16:13.668-08:00 indexer {"cpu_utilization":12.333333333333334,"memory_free":20971921408,"memory_rss":232644608,"memory_total_storage":10489856,"memory_used":241833984,"memory_used_storage":585728,"timestamp":"1614212173668854454","timings/stats_response":"366 1653006709 2675255562006951163","uptime":"5m0.596126616s"}
      2021-02-24T16:17:13.669-08:00 indexer {"cpu_utilization":8.3,"memory_free":21086658560,"memory_rss":241172480,"timestamp":"1614212233669469098","timings/stats_response":"374 1653604975 2675255607031580815","uptime":"6m0.596737182s"}
      2021-02-24T16:18:13.670-08:00 indexer {"cpu_utilization":11.533333333333333,"memory_free":21028188160,"memory_rss":249155584,"timestamp":"1614212293669947037","uptime":"7m0.597216844s"}
      2021-02-24T16:19:13.670-08:00 indexer {"cpu_utilization":8.266666666666666,"memory_free":21071896576,"memory_rss":265846784,"timestamp":"1614212353670486769","uptime":"8m0.597755282s"}
      2021-02-24T16:20:13.671-08:00 indexer {"cpu_utilization":8.233333333333333,"memory_free":21050007552,"memory_rss":274522112,"timestamp":"1614212413670974938","uptime":"9m0.598243582s"}
      2021-02-24T16:21:13.671-08:00 indexer {"cpu_utilization":8.3,"memory_free":20995198976,"memory_rss":286281728,"timestamp":"1614212473671554150","uptime":"10m0.598825899s"}
      2021-02-24T16:22:13.672-08:00 indexer {"cpu_utilization":8.200000000000001,"memory_free":21029502976,"memory_rss":301539328,"timestamp":"1614212533672086444","uptime":"11m0.599355324s"}
      2021-02-24T16:23:13.672-08:00 indexer {"cpu_utilization":8.233333333333333,"memory_free":21016768512,"memory_rss":311119872,"timestamp":"1614212593672643233","uptime":"12m0.599911615s"}
      2021-02-24T16:24:13.673-08:00 indexer {"cpu_utilization":9.633333333333335,"memory_free":20899692544,"memory_rss":321163264,"timestamp":"1614212653673116844","uptime":"13m0.60038549s"}
      2021-02-24T16:25:13.673-08:00 indexer {"cpu_utilization":17.366666666666667,"memory_free":20820537344,"memory_rss":406687744,"timestamp":"1614212713673718257","uptime":"14m0.600987472s"}
      2021-02-24T16:26:13.674-08:00 indexer {"cpu_utilization":35.66666666666667,"memory_free":19433553920,"memory_rss":1713164288,"timestamp":"1614212773674250181","uptime":"15m0.601522101s"}
      2021-02-24T16:27:13.674-08:00 indexer {"cpu_utilization":37.46666666666666,"memory_free":18069827584,"memory_rss":3062398976,"timestamp":"1614212833674909242","uptime":"16m0.6021789s"}
      2021-02-24T16:28:13.675-08:00 indexer {"cpu_utilization":37.166666666666664,"memory_free":16837234688,"memory_rss":4299309056,"timestamp":"1614212893675384092","uptime":"17m0.602652978s"}
      

      This issue is consistently seen in 2 longevity clusters running the same test with the same build. This wasnt seen earlier. The last time test was run with 7.0.0-4525.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            amit.kulkarni Amit Kulkarni
            mihir.kamdar Mihir Kamdar (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There is 1 open Gerrit change

                PagerDuty