Details
-
Bug
-
Resolution: Duplicate
-
Major
-
7.6.0
-
Untriaged
-
0
-
Unknown
Description
https://cv.jenkins.couchbase.com/job/ns-server-cluster-tests/6613/console
After calling /controller/hardResetNode for node n_11 and failing it over, we are trying to add that node back and call rebalance, but the rebalance fails.
13:16:42 HardResetTests.hard_reset_timeout_before_failover_add_node_test... failed [26s]
|
13:17:09 AssertionError: Expected final rebalance status: None
|
13:17:09 Found: Rebalance failed. See logs for detailed reason. You can try again.
|
13:17:09 ================== HardResetTests.hard_reset_timeout_before_failover_add_node_test output begin =================
|
13:17:09 sending POST http://127.0.0.1:9011/diag/eval {'data': 'ns_config:set({node, node(), {timeout, {ns_cluster, hard_reset}}}, 10)', 'timeout': 60} (expected code 200)
|
13:17:09 result: 200
|
13:17:09 sending POST http://127.0.0.1:9011/controller/hardResetNode {'timeout': 60} (expected code 500)
|
13:17:09 result: 500
|
13:17:09 sending GET http://127.0.0.1:9011/pools/default {'timeout': 60} (expected code None)
|
13:17:09 result: 200
|
13:17:09 sending GET http://127.0.0.1:9011/pools/default {'timeout': 60} (expected code None)
|
13:17:09 got exception: HTTPConnectionPool(host='127.0.0.1', port=9011): Max retries exceeded with url: /pools/default (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9a657f6550>: Failed to establish a new connection: [Errno 111] Connection refused'))
|
13:17:09 sending GET http://127.0.0.1:9011/pools/default {'timeout': 60} (expected code None)
|
13:17:09 got exception: HTTPConnectionPool(host='127.0.0.1', port=9011): Max retries exceeded with url: /pools/default (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9a657c8490>: Failed to establish a new connection: [Errno 111] Connection refused'))
|
13:17:09 sending GET http://127.0.0.1:9011/pools/default {'timeout': 60} (expected code None)
|
13:17:09 got exception: HTTPConnectionPool(host='127.0.0.1', port=9011): Max retries exceeded with url: /pools/default (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9a657c1f50>: Failed to establish a new connection: [Errno 111] Connection refused'))
|
13:17:09 sending GET http://127.0.0.1:9011/pools/default {'timeout': 60} (expected code None)
|
13:17:09 got exception: HTTPConnectionPool(host='127.0.0.1', port=9011): Max retries exceeded with url: /pools/default (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9a65793f50>: Failed to establish a new connection: [Errno 111] Connection refused'))
|
13:17:09 sending GET http://127.0.0.1:9011/pools/default {'timeout': 60} (expected code None)
|
13:17:09 got exception: HTTPConnectionPool(host='127.0.0.1', port=9011): Max retries exceeded with url: /pools/default (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9a65b61e10>: Failed to establish a new connection: [Errno 111] Connection refused'))
|
13:17:09 sending GET http://127.0.0.1:9011/pools/default {'timeout': 60} (expected code None)
|
13:17:09 got exception: HTTPConnectionPool(host='127.0.0.1', port=9011): Max retries exceeded with url: /pools/default (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9a6585a990>: Failed to establish a new connection: [Errno 111] Connection refused'))
|
13:17:09 sending GET http://127.0.0.1:9011/pools/default {'timeout': 60} (expected code None)
|
13:17:09 got exception: HTTPConnectionPool(host='127.0.0.1', port=9011): Max retries exceeded with url: /pools/default (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9a657c1650>: Failed to establish a new connection: [Errno 111] Connection refused'))
|
13:17:09 sending GET http://127.0.0.1:9011/pools/default {'timeout': 60} (expected code None)
|
13:17:09 got exception: HTTPConnectionPool(host='127.0.0.1', port=9011): Max retries exceeded with url: /pools/default (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9a657c9810>: Failed to establish a new connection: [Errno 111] Connection refused'))
|
13:17:09 sending GET http://127.0.0.1:9011/pools/default {'timeout': 60} (expected code None)
|
13:17:09 result: 404
|
13:17:09 sending GET http://127.0.0.1:9011/pools/default {'timeout': 60} (expected code 404)
|
13:17:09 result: 404
|
13:17:09 sending GET http://127.0.0.1:9010/pools/default {'timeout': 60} (expected code 200)
|
13:17:09 result: 200
|
13:17:09 sending GET http://127.0.0.1:9010/pools/default/terseClusterInfo {'timeout': 60} (expected code 200)
|
13:17:09 result: 200
|
13:17:09 sending GET http://127.0.0.1:9010/pools/nodes {'timeout': 60} (expected code 200)
|
13:17:09 result: 200
|
13:17:09 sending GET http://127.0.0.1:9011/nodes/self {'timeout': 60} (expected code 200)
|
13:17:09 result: 200
|
13:17:09 Failing over node {'user': 'Administrator', 'password': 'asdasd', 'otpNode': 'n_11@127.0.0.1', 'allowUnsafe': 'true'}
|
13:17:09 sending POST http://127.0.0.1:9010/controller/startFailover {'data': {'user': 'Administrator', 'password': 'asdasd', 'otpNode': 'n_11@127.0.0.1', 'allowUnsafe': 'true'}, 'timeout': 60} (expected code 200)
|
13:17:09 result: 200
|
13:17:09 Waiting up to 600s for rebalance to finish. Finished.
|
13:17:09 sending POST http://127.0.0.1:9010/controller/addNode {'data': {'user': 'Administrator', 'password': 'asdasd', 'hostname': 'https://127.0.0.1:19011', 'services': 'kv,index'}, 'timeout': 60} (expected code 200)
|
13:17:09 result: 200
|
13:17:09 sending GET http://127.0.0.1:9010/nodeStatuses {'timeout': 60} (expected code None)
|
13:17:09 result: 200
|
13:17:09 Starting rebalance with {'knownNodes': 'n_10@127.0.0.1,n_11@127.0.0.1', 'ejectedNodes': ''}
|
13:17:09 sending GET http://127.0.0.1:9010/pools/default {'timeout': 60} (expected code 200)
|
13:17:09 result: 200
|
13:17:09 sending POST http://127.0.0.1:9010/controller/rebalance {'data': {'knownNodes': 'n_10@127.0.0.1,n_11@127.0.0.1', 'ejectedNodes': ''}, 'timeout': 60} (expected code 200)
|
13:17:09 result: 200
|
13:17:09 Waiting up to 600s for rebalance to finish............. Finished.
|
13:17:09 =================== HardResetTests.hard_reset_timeout_before_failover_add_node_test output end ==================
|
13:17:09
|
13:17:09 Traceback with variables (most recent call last):
|
13:17:09 File "/home/couchbase/jenkins/workspace/ns-server-cluster-tests/ns_server/cluster_tests/testlib/testlib.py", line 190, in safe_test_function_call
|
13:17:09 res = apply_with_seed(testset, testfunction, args, seed)
|
13:17:09 testset = <testsets.hard_reset_test.HardResetTests object at 0x7f9a66e041d0>
|
13:17:09 testfunction = 'hard_reset_timeout_before_failover_add_node_test'
|
13:17:09 args = []
|
13:17:09 testiter = 0
|
13:17:09 verbose = True
|
13:17:09 intercept_output = True
|
13:17:09 seed = b'\xe7\x1bp\xf7\xbf\xc0\xccMXa\x82\xae4\x8b\xbfl'
|
13:17:09 dry_run = False
|
13:17:09 res = None
|
13:17:09 error = None
|
13:17:09 testname = 'HardResetTests.hard_reset_timeout_before_failover_add_node_test'
|
13:17:09 report_call = <contextlib._GeneratorContextManager object at 0x7f9a670861d0>
|
13:17:09 e = AssertionError('Expected final rebalance status: None\nFound: Rebalance failed. See logs for detailed reason. You can try again.')
|
13:17:09 cscheme = <traceback_with_variables.color.ColorScheme object at 0x7f9a67ef6dd0>
|
13:17:09 File "/home/couchbase/jenkins/workspace/ns-server-cluster-tests/ns_server/cluster_tests/testlib/testlib.py", line 203, in apply_with_seed
|
13:17:09 return getattr(obj, func)(*args)
|
13:17:09 obj = <testsets.hard_reset_test.HardResetTests object at 0x7f9a66e041d0>
|
13:17:09 func = 'hard_reset_timeout_before_failover_add_node_test'
|
13:17:09 args = []
|
13:17:09 seed = b'\xe7\x1bp\xf7\xbf\xc0\xccMXa\x82\xae4\x8b\xbfl'
|
13:17:09 rand_state = (3, (2147483648, 548738641, 2310014575, 904221336, 1184853676, 2138714585, 202092079, 1457365115, 1201876102, 75035968, 2415587986, 2614516969, 1902697633, 3356926413, 43755069, 1525913765, 62827374, 3626047610, 2491787950, 14373715, 2016590401, 372552817, 2761142435, 3782391484, 6232639, 1040326943, 4058187564, 1273738009, 1216887744, 4224800154, 3646011918, 3567010848, 1224652181, 1387199792, 3156387121, 136790963, 3875432586, 1060055346, 1632721767, 2670604253, 1950208612, 158445762, 1032531906, 1365977042, 1762153167, 1136990933, 3069287899, 3266261218, 872026493, 1110231976, 769667073, 947830109, 225201989, 1865292606, 3110648526, 2648659540, 583887651, 2494558848, 1712456811, 3580865464, 588294489, 895080431, 2106162704, 1955433875, 3523269845, 93576890, 4185691806, 3010022169, 4277507767, 1075011569, 2333541026, 1236532598, 70156441, 83392581, 2048211664, 3413351134, 2824533669, 1970315118, 2027221593, 1034245600, 1417370984, 1805806358, 2697212911, 584000181, 1590950626, 381213...
|
13:17:09 File "/home/couchbase/jenkins/workspace/ns-server-cluster-tests/ns_server/cluster_tests/testsets/hard_reset_test.py", line 162, in hard_reset_timeout_before_failover_add_node_test
|
13:17:09 self.hard_reset_timeout_before_failover_testbase(self.cluster.add_node)
|
13:17:09 self = <testsets.hard_reset_test.HardResetTests object at 0x7f9a66e041d0>
|
13:17:09 File "/home/couchbase/jenkins/workspace/ns-server-cluster-tests/ns_server/cluster_tests/testsets/hard_reset_test.py", line 149, in hard_reset_timeout_before_failover_testbase
|
13:17:09 self.cluster.rebalance(wait=True, verbose=True)
|
13:17:09 self = <testsets.hard_reset_test.HardResetTests object at 0x7f9a66e041d0>
|
13:17:09 add_node_fun = <bound method Cluster.add_node of {'_nodes': [{'url': 'http://127.0.0.1:9010', 'hostname_cached': '127.0.0.1:9010', 'host': '127.0.0.1', 'port': 9010, 'auth': ('Administrator', 'asdasd'), 'data_path_cache': None, 'tls_port_cache': 19010, 'services_cached': None}, {'url': 'http://127.0.0.1:9011', 'hostname_cached': '127.0.0.1:9011', 'host': '127.0.0.1', 'port': 9011, 'auth': ('Administrator', 'asdasd'), 'data_path_cache': None, 'tls_port_cache': 19011, 'services_cached': None}], 'connected_nodes': [{'url': 'http://127.0.0.1:9010', 'hostname_cached': '127.0.0.1:9010', 'host': '127.0.0.1', 'port': 9010, 'auth': ('Administrator', 'asdasd'), 'data_path_cache': None, 'tls_port_cache': 19010, 'services_cached': None}, {'url': 'http://127.0.0.1:9011', 'hostname_cached': '127.0.0.1:9011', 'host': '127.0.0.1', 'port': 9011, 'auth': ('Administrator', 'asdasd'), 'data_path_cache': None, 'tls_port_cache': 19011, 'services_cached': None}], 'first_node_index': 10, 'index': 7, 'processes': [<Popen: re...
|
13:17:09 node = {'url': 'http://127.0.0.1:9010', 'hostname_cached': '127.0.0.1:9010', 'host': '127.0.0.1', 'port': 9010, 'auth': ('Administrator', 'asdasd'), 'data_path_cache': None, 'tls_port_cache': 19010, 'services_cached': None}
|
13:17:09 other_node = {'url': 'http://127.0.0.1:9011', 'hostname_cached': '127.0.0.1:9011', 'host': '127.0.0.1', 'port': 9011, 'auth': ('Administrator', 'asdasd'), 'data_path_cache': None, 'tls_port_cache': 19011, 'services_cached': None}
|
13:17:09 r = <Response [500]>
|
13:17:09 otp_other_node = 'n_11@127.0.0.1'
|
13:17:09 File "/home/couchbase/jenkins/workspace/ns-server-cluster-tests/ns_server/cluster_tests/testlib/cluster.py", line 227, in rebalance
|
13:17:09 assert error is expected_error, \
|
13:17:09 self = {'_nodes': [{'url': 'http://127.0.0.1:9010', 'hostname_cached': '127.0.0.1:9010', 'host': '127.0.0.1', 'port': 9010, 'auth': ('Administrator', 'asdasd'), 'data_path_cache': None, 'tls_port_cache': 19010, 'services_cached': None}, {'url': 'http://127.0.0.1:9011', 'hostname_cached': '127.0.0.1:9011', 'host': '127.0.0.1', 'port': 9011, 'auth': ('Administrator', 'asdasd'), 'data_path_cache': None, 'tls_port_cache': 19011, 'services_cached': None}], 'connected_nodes': [{'url': 'http://127.0.0.1:9010', 'hostname_cached': '127.0.0.1:9010', 'host': '127.0.0.1', 'port': 9010, 'auth': ('Administrator', 'asdasd'), 'data_path_cache': None, 'tls_port_cache': 19010, 'services_cached': None}, {'url': 'http://127.0.0.1:9011', 'hostname_cached': '127.0.0.1:9011', 'host': '127.0.0.1', 'port': 9011, 'auth': ('Administrator', 'asdasd'), 'data_path_cache': None, 'tls_port_cache': 19011, 'services_cached': None}], 'first_node_index': 10, 'index': 7, 'processes': [<Popen: returncode: None args: ['erl', '+MMm...
|
13:17:09 ejected_nodes = None
|
13:17:09 wait = True
|
13:17:09 timeout_s = 600
|
13:17:09 verbose = True
|
13:17:09 expected_error = None
|
13:17:09 initial_code = 200
|
13:17:09 initial_expected_error = None
|
13:17:09 known_nodes_string = 'n_10@127.0.0.1,n_11@127.0.0.1'
|
13:17:09 ejected_nodes_string = ''
|
13:17:09 data = {'knownNodes': 'n_10@127.0.0.1,n_11@127.0.0.1', 'ejectedNodes': ''}
|
13:17:09 resp = <Response [200]>
|
13:17:09 failed_nodes = []
|
13:17:09 error = 'Rebalance failed. See logs for detailed reason. You can try again.'
|
13:17:09 failed_hostnames = []
|
13:17:09 otp_nodes = {'127.0.0.1:9010': 'n_10@127.0.0.1', '127.0.0.1:9011': 'n_11@127.0.0.1'}
|
13:17:09 builtins.AssertionError: Expected final rebalance status: None
|
13:17:09 Found: Rebalance failed. See logs for detailed reason. You can try again.
|
13:17:09 HardResetTests.hard_reset_timeout_before_failover_join_cluster_test... failed [27s]
|
13:17:36 AssertionError: Expected final rebalance status: None
|
13:17:36 Found: Rebalance failed. See logs for detailed reason. You can try again.
|
|
After a quick look it seems like it fails because indexer agent crashes:
[user:error,2023-12-11T21:17:08.873Z,n_10@127.0.0.1:<0.2609.1>:ns_orchestrator:log_rebalance_completion:1661]Rebalance exited with reason {service_rebalance_failed,index,
|
{agent_died,<35633.28642.0>,
|
{lost_connection,{'n_11@127.0.0.1',shutdown}}}}.
|
Rebalance Operation Id = d1a807f735c701e671cb1af2b3cfb93c
|
Which seems to be connected to panic in indexer at node n_11:
2023-12-11T21:17:06.906+00:00 [Error] RequestHandler::getIndexStatus: Error while retrieving 127.0.0.1:9168/getLocalIndexMetadata with auth Fail to unmarshal response from 127.0.0.1:9168 Error Count after pass: 1
|
panic: runtime error: invalid memory address or nil pointer dereference
|
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xe5429c]
|
|
goroutine 72 [running]:
|
github.com/couchbase/indexing/secondary/manager.(*LifecycleMgr).broadcastStats(0xc004a30840)
|
/home/couchbase/jenkins/workspace/ns-server-cluster-tests/goproj/src/github.com/couchbase/indexing/secondary/manager/lifecycle.go:3016 +0x21c
|
created by github.com/couchbase/indexing/secondary/manager.(*LifecycleMgr).Run in goroutine 1
|
/home/couchbase/jenkins/workspace/ns-server-cluster-tests/goproj/src/github.com/couchbase/indexing/secondary/manager/lifecycle.go:271 +0xe5
|
Attachments
Issue Links
- duplicates
-
MB-60063 "panic: runtime error: invalid memory address or nil pointer dereference" observed in indexer logs
- Closed