Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-33453

Intermittent : Rebalance operation of adding nodes into cluster failed

    XMLWordPrintable

Details

    • Untriaged
    • Unknown

    Description

      Build : 6.5.0-2686

      This is seen intermittently in the very first test of the sanity suite. The first step is to setup a cluster - add a node to an existing cluster of one node. There are no buckets at that point. This rebalance operation fails some times.

      [ns_server:error,2019-03-20T16:02:08.194-07:00,ns_1@172.23.120.173:service_agent-index<0.2183.0>:service_agent:handle_info:231]Linked process <0.2471.0> died with reason {timeout,
                                                  {gen_server,call,
                                                   [<0.2220.0>,
                                                    {call,"ServiceAPI.GetTaskList",
                                                     #Fun<json_rpc_connection.0.102434519>},
                                                    60000]}}. Terminating
      [ns_server:error,2019-03-20T16:02:08.194-07:00,ns_1@172.23.120.173:service_agent-index<0.2183.0>:service_agent:terminate:260]Terminating abnormally
      [ns_server:error,2019-03-20T16:02:08.194-07:00,ns_1@172.23.120.173:service_agent-index<0.2183.0>:service_agent:terminate:265]Terminating json rpc connection for index: <0.2220.0>
      [ns_server:error,2019-03-20T16:02:08.195-07:00,ns_1@172.23.120.173:service_rebalancer-index<0.4013.0>:service_rebalancer:run_rebalance:82]Agent terminated during the rebalance: {'DOWN',
                                              #Ref<0.2176371542.1558970372.43169>,
                                              process,<0.2183.0>,
                                              {linked_process_died,<0.2471.0>,
                                               {timeout,
                                                {gen_server,call,
                                                 [<0.2220.0>,
                                                  {call,"ServiceAPI.GetTaskList",
                                                   #Fun<json_rpc_connection.0.102434519>},
                                                  60000]}}}}
      [ns_server:error,2019-03-20T16:02:08.195-07:00,ns_1@172.23.120.173:service_agent-index<0.5226.0>:service_agent:handle_call:182]Got rebalance-only call {if_rebalance,<0.4013.0>,unset_rebalancer} that doesn't match rebalancer pid undefined
      [ns_server:error,2019-03-20T16:02:08.196-07:00,ns_1@172.23.120.173:service_rebalancer-index<0.4013.0>:service_agent:process_bad_results:810]Service call unset_rebalancer (service index) failed on some nodes:
      [{'ns_1@172.23.120.173',nack}]
      [ns_server:warn,2019-03-20T16:02:08.196-07:00,ns_1@172.23.120.173:service_rebalancer-index<0.4013.0>:service_rebalancer:run_rebalance:91]Failed to unset rebalancer on some nodes:
      {error,{bad_nodes,index,unset_rebalancer,[{'ns_1@172.23.120.173',nack}]}}
      [ns_server:warn,2019-03-20T16:02:08.195-07:00,ns_1@172.23.120.173:<0.4034.0>:async:maybe_log_down_message:480]Monitored process <0.4027.0> terminated abnormally (reason = {linked_process_died,
                                                                    <0.2471.0>,
                                                                    {timeout,
                                                                     {gen_server,
                                                                      call,
                                                                      [<0.2220.0>,
                                                                       {call,
                                                                        "ServiceAPI.GetTaskList",
                                                                        #Fun<json_rpc_connection.0.102434519>},
                                                                       60000]}}})
      [ns_server:warn,2019-03-20T16:02:08.196-07:00,ns_1@172.23.120.173:<0.4035.0>:async:maybe_log_down_message:480]Monitored process <0.4027.0> terminated abnormally (reason = {linked_process_died,
                                                                    <0.2471.0>,
                                                                    {timeout,
                                                                     {gen_server,
                                                                      call,
                                                                      [<0.2220.0>,
                                                                       {call,
                                                                        "ServiceAPI.GetTaskList",
                                                                        #Fun<json_rpc_connection.0.102434519>},
                                                                       60000]}}})
      [ns_server:info,2019-03-20T16:02:08.196-07:00,ns_1@172.23.120.173:leader_registry<0.3547.0>:leader_registry_server:handle_down:253]Process <0.3806.0> registered as 'ns_rebalance_observer' terminated.
      [user:error,2019-03-20T16:02:08.196-07:00,ns_1@172.23.120.173:<0.3573.0>:ns_orchestrator:do_log_rebalance_completion:1243]Rebalance exited with reason {service_rebalance_failed,index,
                                    {linked_process_died,<0.2471.0>,
                                     {timeout,
                                      {gen_server,call,
                                       [<0.2220.0>,
                                        {call,"ServiceAPI.GetTaskList",
                                         #Fun<json_rpc_connection.0.102434519>},
                                        60000]}}}}. Operation Id = 1cd752172291bf432f8b439c471033c8
      

      In the index logs around the same time, we see -

      2019-03-20T16:02:06.827-07:00 [Info] PeriodicStats = {"cpu_utilization":0.33330000333300003,"index_not_found_errcount":0,"indexer_state":"Active","memory_free":43749376,"memory_quota":268435456,"memory_rss":32002048,"memory_total":3962036224,"memory_total_storage":4775936,"memory_used":60612608,"memory_used_queue":0,"memory_used_storage":163840,"needs_restart":false,"num_connections":0,"num_cpu_core":4,"storage_mode":"plasma","timings/stats_response":"70 6378342 665682935552","uptime":"1m0.06985698s"}
      2019-03-20T16:02:06.848-07:00 [Info] janitor: running cleanup.
      2019-03-20T16:02:06.848-07:00 [Info] janitor: go-routine terminates.
      2019-03-20T16:02:06.848-07:00 [Info] builder: recovering scheduled index
      2019-03-20T16:02:06.848-07:00 [Info] builder: go-routine terminates.
      2019-03-20T16:02:06.848-07:00 [Info] manager ResetConnection: closing metadata repo
      2019-03-20T16:02:06.848-07:00 [INFO][FDB] Forestdb closed database file /opt/couchbase/var/lib/couchbase/data/@2i/MetadataStore
      2019-03-20T16:02:06.848-07:00 [Error] PeerListener.listen(): Error in accepting new connection.  Error = accept tcp [::]:9100: use of closed network connection. Terminate.
      2019-03-20T16:02:06.849-07:00 [Info] memstats {"Alloc":52625032, "TotalAlloc":54099552, "Sys":60315896, "Lookups":49, "Mallocs":26069,"Frees":14271, "HeapAlloc":52625032, "HeapSys":54427648, "HeapIdle":712704, "HeapInuse":53714944,"HeapReleased":0, "HeapObjects":11798,"GCSys":1957888, "LastGC":1553122866875326778,"PauseTotalNs":17648245, "PauseNs":[2314538, 1552393, 13781314], "NumGC":3}
      2019-03-20T16:02:07.848-07:00 [Info] manager ResetConnection: creating new lifecycle mgr
      2019-03-20T16:02:07.856-07:00 [INFO][FDB] Forestdb opened database file /opt/couchbase/var/lib/couchbase/data/@2i/MetadataStore
      2019-03-20T16:02:07.856-07:00 [Info] manager ResetConnection: restarting metadadta repo
      2019-03-20T16:02:07.856-07:00 [Info] manager ResetConnection: restarting lifecycle mgr
      2019-03-20T16:02:07.856-07:00 [Info] handleSecurityChange: restarting mutation manager
      2019-03-20T16:02:07.856-07:00 [Info] handleSecurityChange: restarting scan coordinator
      2019-03-20T16:02:07.856-07:00 [Info] [Queryport ":9101"] ... stopped
      2019-03-20T16:02:07.856-07:00 [Info] [Queryport ":9101"] ... restarting listener
      2019-03-20T16:02:07.857-07:00 [Error] [Queryport ":9101"] failed starting listener :9101 listen tcp :9101: bind: address already in use !!
      2019-03-20T16:02:07.857-07:00 [Error] [Queryport ":9101"] Accept() Error: accept tcp [::]:9101: use of closed network connection
      2019-03-20T16:02:07.865-07:00 [Info] updator: updating service map.  server group=Group 1, indexerVersion=4 nodeAddr 172.23.120.173:8091 clusterVersion 4 excludeNode %!(EXTRA uint64=2)
      2019/03/20 16:02:08 revrpc: Got error (EOF) and will retry in 1s
      2019-03-20T16:02:08.252-07:00 [Info] serviceChangeNotifier: received PoolChangeNotification
      2019-03-20T16:02:08.857-07:00 [Error] HTTPS Server terminates on error: accept tcp [::]:19102: use of closed network connection
      2019-03-20T16:02:09.196-07:00 [Info] ServiceMgr::GetTaskList []
      2019-03-20T16:02:09.857-07:00 [Info] handleSecurityChange: restarting http/https server
      2019-03-20T16:02:09.857-07:00 [Info] indexer:: SSL server started: :19102
      

      Logs for both the nodes are attached.

      Not marking this as a test blocker as it is not consistently reproducible.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            mihir.kamdar Mihir Kamdar (Inactive)
            mihir.kamdar Mihir Kamdar (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty