Loading...

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: 6.5.0
Affects Version/s: 6.5.0
Component/s: secondary-index
Labels:
- functional-test
- sanity

Triage:
Untriaged
Is this a Regression?:
Unknown

Description

Build : 6.5.0-2686

This is seen intermittently in the very first test of the sanity suite. The first step is to setup a cluster - add a node to an existing cluster of one node. There are no buckets at that point. This rebalance operation fails some times.

[ns_server:error,2019-03-20T16:02:08.194-07:00,ns_1@172.23.120.173:service_agent-index<0.2183.0>:service_agent:handle_info:231]Linked process <0.2471.0> died with reason {timeout,

                                            {gen_server,call,

                                             [<0.2220.0>,

                                              {call,"ServiceAPI.GetTaskList",

                                               #Fun<json_rpc_connection.0.102434519>},

                                              60000]}}. Terminating

[ns_server:error,2019-03-20T16:02:08.194-07:00,ns_1@172.23.120.173:service_agent-index<0.2183.0>:service_agent:terminate:260]Terminating abnormally

[ns_server:error,2019-03-20T16:02:08.194-07:00,ns_1@172.23.120.173:service_agent-index<0.2183.0>:service_agent:terminate:265]Terminating json rpc connection for index: <0.2220.0>

[ns_server:error,2019-03-20T16:02:08.195-07:00,ns_1@172.23.120.173:service_rebalancer-index<0.4013.0>:service_rebalancer:run_rebalance:82]Agent terminated during the rebalance: {'DOWN',

                                        #Ref<0.2176371542.1558970372.43169>,

                                        process,<0.2183.0>,

                                        {linked_process_died,<0.2471.0>,

                                         {timeout,

                                          {gen_server,call,

                                           [<0.2220.0>,

                                            {call,"ServiceAPI.GetTaskList",

                                             #Fun<json_rpc_connection.0.102434519>},

                                            60000]}}}}

[ns_server:error,2019-03-20T16:02:08.195-07:00,ns_1@172.23.120.173:service_agent-index<0.5226.0>:service_agent:handle_call:182]Got rebalance-only call {if_rebalance,<0.4013.0>,unset_rebalancer} that doesn't match rebalancer pid undefined

[ns_server:error,2019-03-20T16:02:08.196-07:00,ns_1@172.23.120.173:service_rebalancer-index<0.4013.0>:service_agent:process_bad_results:810]Service call unset_rebalancer (service index) failed on some nodes:

[{'ns_1@172.23.120.173',nack}]

[ns_server:warn,2019-03-20T16:02:08.196-07:00,ns_1@172.23.120.173:service_rebalancer-index<0.4013.0>:service_rebalancer:run_rebalance:91]Failed to unset rebalancer on some nodes:

{error,{bad_nodes,index,unset_rebalancer,[{'ns_1@172.23.120.173',nack}]}}

[ns_server:warn,2019-03-20T16:02:08.195-07:00,ns_1@172.23.120.173:<0.4034.0>:async:maybe_log_down_message:480]Monitored process <0.4027.0> terminated abnormally (reason = {linked_process_died,

                                                              <0.2471.0>,

                                                              {timeout,

                                                               {gen_server,

                                                                call,

                                                                [<0.2220.0>,

                                                                 {call,

                                                                  "ServiceAPI.GetTaskList",

                                                                  #Fun<json_rpc_connection.0.102434519>},

                                                                 60000]}}})

[ns_server:warn,2019-03-20T16:02:08.196-07:00,ns_1@172.23.120.173:<0.4035.0>:async:maybe_log_down_message:480]Monitored process <0.4027.0> terminated abnormally (reason = {linked_process_died,

                                                              <0.2471.0>,

                                                              {timeout,

                                                               {gen_server,

                                                                call,

                                                                [<0.2220.0>,

                                                                 {call,

                                                                  "ServiceAPI.GetTaskList",

                                                                  #Fun<json_rpc_connection.0.102434519>},

                                                                 60000]}}})

[ns_server:info,2019-03-20T16:02:08.196-07:00,ns_1@172.23.120.173:leader_registry<0.3547.0>:leader_registry_server:handle_down:253]Process <0.3806.0> registered as 'ns_rebalance_observer' terminated.

[user:error,2019-03-20T16:02:08.196-07:00,ns_1@172.23.120.173:<0.3573.0>:ns_orchestrator:do_log_rebalance_completion:1243]Rebalance exited with reason {service_rebalance_failed,index,

                              {linked_process_died,<0.2471.0>,

                               {timeout,

                                {gen_server,call,

                                 [<0.2220.0>,

                                  {call,"ServiceAPI.GetTaskList",

                                   #Fun<json_rpc_connection.0.102434519>},

                                  60000]}}}}. Operation Id = 1cd752172291bf432f8b439c471033c8

In the index logs around the same time, we see -

2019-03-20T16:02:06.827-07:00 [Info] PeriodicStats = {"cpu_utilization":0.33330000333300003,"index_not_found_errcount":0,"indexer_state":"Active","memory_free":43749376,"memory_quota":268435456,"memory_rss":32002048,"memory_total":3962036224,"memory_total_storage":4775936,"memory_used":60612608,"memory_used_queue":0,"memory_used_storage":163840,"needs_restart":false,"num_connections":0,"num_cpu_core":4,"storage_mode":"plasma","timings/stats_response":"70 6378342 665682935552","uptime":"1m0.06985698s"}

2019-03-20T16:02:06.848-07:00 [Info] janitor: running cleanup.

2019-03-20T16:02:06.848-07:00 [Info] janitor: go-routine terminates.

2019-03-20T16:02:06.848-07:00 [Info] builder: recovering scheduled index

2019-03-20T16:02:06.848-07:00 [Info] builder: go-routine terminates.

2019-03-20T16:02:06.848-07:00 [Info] manager ResetConnection: closing metadata repo

2019-03-20T16:02:06.848-07:00 [INFO][FDB] Forestdb closed database file /opt/couchbase/var/lib/couchbase/data/@2i/MetadataStore

2019-03-20T16:02:06.848-07:00 [Error] PeerListener.listen(): Error in accepting new connection.  Error = accept tcp [::]:9100: use of closed network connection. Terminate.

2019-03-20T16:02:06.849-07:00 [Info] memstats {"Alloc":52625032, "TotalAlloc":54099552, "Sys":60315896, "Lookups":49, "Mallocs":26069,"Frees":14271, "HeapAlloc":52625032, "HeapSys":54427648, "HeapIdle":712704, "HeapInuse":53714944,"HeapReleased":0, "HeapObjects":11798,"GCSys":1957888, "LastGC":1553122866875326778,"PauseTotalNs":17648245, "PauseNs":[2314538, 1552393, 13781314], "NumGC":3}

2019-03-20T16:02:07.848-07:00 [Info] manager ResetConnection: creating new lifecycle mgr

2019-03-20T16:02:07.856-07:00 [INFO][FDB] Forestdb opened database file /opt/couchbase/var/lib/couchbase/data/@2i/MetadataStore

2019-03-20T16:02:07.856-07:00 [Info] manager ResetConnection: restarting metadadta repo

2019-03-20T16:02:07.856-07:00 [Info] manager ResetConnection: restarting lifecycle mgr

2019-03-20T16:02:07.856-07:00 [Info] handleSecurityChange: restarting mutation manager

2019-03-20T16:02:07.856-07:00 [Info] handleSecurityChange: restarting scan coordinator

2019-03-20T16:02:07.856-07:00 [Info] [Queryport ":9101"] ... stopped

2019-03-20T16:02:07.856-07:00 [Info] [Queryport ":9101"] ... restarting listener

2019-03-20T16:02:07.857-07:00 [Error] [Queryport ":9101"] failed starting listener :9101 listen tcp :9101: bind: address already in use !!

2019-03-20T16:02:07.857-07:00 [Error] [Queryport ":9101"] Accept() Error: accept tcp [::]:9101: use of closed network connection

2019-03-20T16:02:07.865-07:00 [Info] updator: updating service map.  server group=Group 1, indexerVersion=4 nodeAddr 172.23.120.173:8091 clusterVersion 4 excludeNode %!(EXTRA uint64=2)

2019/03/20 16:02:08 revrpc: Got error (EOF) and will retry in 1s

2019-03-20T16:02:08.252-07:00 [Info] serviceChangeNotifier: received PoolChangeNotification

2019-03-20T16:02:08.857-07:00 [Error] HTTPS Server terminates on error: accept tcp [::]:19102: use of closed network connection

2019-03-20T16:02:09.196-07:00 [Info] ServiceMgr::GetTaskList []

2019-03-20T16:02:09.857-07:00 [Info] handleSecurityChange: restarting http/https server

2019-03-20T16:02:09.857-07:00 [Info] indexer:: SSL server started: :19102

Logs for both the nodes are attached.

Not marking this as a test blocker as it is not consistently reproducible.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

172.23.120.173-20190320-1601-diag.zip
8.16 MB
20/Mar/19 8:58 PM
172.23.120.174-20190320-1602-diag.zip
7.96 MB
20/Mar/19 8:58 PM

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Intermittent : Rebalance operation of adding nodes into cluster failed

Details

Description

Attachments

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty