Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Duplicate
Priority: Major
Fix Version/s: 7.6.0
Affects Version/s: 7.6.0
Component/s: ns_server, secondary-index
Labels:
- couchbase
- request-dev-verify

Triage:
Untriaged
Story Points:
0
Is this a Regression?:
Unknown

Description

https://cv.jenkins.couchbase.com/job/ns-server-cluster-tests/6613/console

After calling /controller/hardResetNode for node n_11 and failing it over, we are trying to add that node back and call rebalance, but the rebalance fails.

13:16:42   HardResetTests.hard_reset_timeout_before_failover_add_node_test...      failed [26s]

13:17:09     AssertionError: Expected final rebalance status: None

13:17:09 Found: Rebalance failed. See logs for detailed reason. You can try again.

13:17:09 ================== HardResetTests.hard_reset_timeout_before_failover_add_node_test output begin =================

13:17:09 sending POST http://127.0.0.1:9011/diag/eval {'data': 'ns_config:set({node, node(), {timeout, {ns_cluster, hard_reset}}}, 10)', 'timeout': 60} (expected code 200)

13:17:09 result: 200

13:17:09 sending POST http://127.0.0.1:9011/controller/hardResetNode {'timeout': 60} (expected code 500)

13:17:09 result: 500

13:17:09 sending GET http://127.0.0.1:9011/pools/default {'timeout': 60} (expected code None)

13:17:09 result: 200

13:17:09 sending GET http://127.0.0.1:9011/pools/default {'timeout': 60} (expected code None)

13:17:09 got exception: HTTPConnectionPool(host='127.0.0.1', port=9011): Max retries exceeded with url: /pools/default (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9a657f6550>: Failed to establish a new connection: [Errno 111] Connection refused'))

13:17:09 sending GET http://127.0.0.1:9011/pools/default {'timeout': 60} (expected code None)

13:17:09 got exception: HTTPConnectionPool(host='127.0.0.1', port=9011): Max retries exceeded with url: /pools/default (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9a657c8490>: Failed to establish a new connection: [Errno 111] Connection refused'))

13:17:09 sending GET http://127.0.0.1:9011/pools/default {'timeout': 60} (expected code None)

13:17:09 got exception: HTTPConnectionPool(host='127.0.0.1', port=9011): Max retries exceeded with url: /pools/default (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9a657c1f50>: Failed to establish a new connection: [Errno 111] Connection refused'))

13:17:09 sending GET http://127.0.0.1:9011/pools/default {'timeout': 60} (expected code None)

13:17:09 got exception: HTTPConnectionPool(host='127.0.0.1', port=9011): Max retries exceeded with url: /pools/default (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9a65793f50>: Failed to establish a new connection: [Errno 111] Connection refused'))

13:17:09 sending GET http://127.0.0.1:9011/pools/default {'timeout': 60} (expected code None)

13:17:09 got exception: HTTPConnectionPool(host='127.0.0.1', port=9011): Max retries exceeded with url: /pools/default (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9a65b61e10>: Failed to establish a new connection: [Errno 111] Connection refused'))

13:17:09 sending GET http://127.0.0.1:9011/pools/default {'timeout': 60} (expected code None)

13:17:09 got exception: HTTPConnectionPool(host='127.0.0.1', port=9011): Max retries exceeded with url: /pools/default (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9a6585a990>: Failed to establish a new connection: [Errno 111] Connection refused'))

13:17:09 sending GET http://127.0.0.1:9011/pools/default {'timeout': 60} (expected code None)

13:17:09 got exception: HTTPConnectionPool(host='127.0.0.1', port=9011): Max retries exceeded with url: /pools/default (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9a657c1650>: Failed to establish a new connection: [Errno 111] Connection refused'))

13:17:09 sending GET http://127.0.0.1:9011/pools/default {'timeout': 60} (expected code None)

13:17:09 got exception: HTTPConnectionPool(host='127.0.0.1', port=9011): Max retries exceeded with url: /pools/default (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9a657c9810>: Failed to establish a new connection: [Errno 111] Connection refused'))

13:17:09 sending GET http://127.0.0.1:9011/pools/default {'timeout': 60} (expected code None)

13:17:09 result: 404

13:17:09 sending GET http://127.0.0.1:9011/pools/default {'timeout': 60} (expected code 404)

13:17:09 result: 404

13:17:09 sending GET http://127.0.0.1:9010/pools/default {'timeout': 60} (expected code 200)

13:17:09 result: 200

13:17:09 sending GET http://127.0.0.1:9010/pools/default/terseClusterInfo {'timeout': 60} (expected code 200)

13:17:09 result: 200

13:17:09 sending GET http://127.0.0.1:9010/pools/nodes {'timeout': 60} (expected code 200)

13:17:09 result: 200

13:17:09 sending GET http://127.0.0.1:9011/nodes/self {'timeout': 60} (expected code 200)

13:17:09 result: 200

13:17:09 Failing over node {'user': 'Administrator', 'password': 'asdasd', 'otpNode': 'n_11@127.0.0.1', 'allowUnsafe': 'true'}

13:17:09 sending POST http://127.0.0.1:9010/controller/startFailover {'data': {'user': 'Administrator', 'password': 'asdasd', 'otpNode': 'n_11@127.0.0.1', 'allowUnsafe': 'true'}, 'timeout': 60} (expected code 200)

13:17:09 result: 200

13:17:09 Waiting up to 600s for rebalance to finish. Finished.

13:17:09 sending POST http://127.0.0.1:9010/controller/addNode {'data': {'user': 'Administrator', 'password': 'asdasd', 'hostname': 'https://127.0.0.1:19011', 'services': 'kv,index'}, 'timeout': 60} (expected code 200)

13:17:09 result: 200

13:17:09 sending GET http://127.0.0.1:9010/nodeStatuses {'timeout': 60} (expected code None)

13:17:09 result: 200

13:17:09 Starting rebalance with {'knownNodes': 'n_10@127.0.0.1,n_11@127.0.0.1', 'ejectedNodes': ''}

13:17:09 sending GET http://127.0.0.1:9010/pools/default {'timeout': 60} (expected code 200)

13:17:09 result: 200

13:17:09 sending POST http://127.0.0.1:9010/controller/rebalance {'data': {'knownNodes': 'n_10@127.0.0.1,n_11@127.0.0.1', 'ejectedNodes': ''}, 'timeout': 60} (expected code 200)

13:17:09 result: 200

13:17:09 Waiting up to 600s for rebalance to finish............. Finished.

13:17:09 =================== HardResetTests.hard_reset_timeout_before_failover_add_node_test output end ==================

13:17:09

13:17:09 Traceback with variables (most recent call last):

13:17:09   File "/home/couchbase/jenkins/workspace/ns-server-cluster-tests/ns_server/cluster_tests/testlib/testlib.py", line 190, in safe_test_function_call

13:17:09     res = apply_with_seed(testset, testfunction, args, seed)

13:17:09       testset = <testsets.hard_reset_test.HardResetTests object at 0x7f9a66e041d0>

13:17:09       testfunction = 'hard_reset_timeout_before_failover_add_node_test'

13:17:09       args = []

13:17:09       testiter = 0

13:17:09       verbose = True

13:17:09       intercept_output = True

13:17:09       seed = b'\xe7\x1bp\xf7\xbf\xc0\xccMXa\x82\xae4\x8b\xbfl'

13:17:09       dry_run = False

13:17:09       res = None

13:17:09       error = None

13:17:09       testname = 'HardResetTests.hard_reset_timeout_before_failover_add_node_test'

13:17:09       report_call = <contextlib._GeneratorContextManager object at 0x7f9a670861d0>

13:17:09       e = AssertionError('Expected final rebalance status: None\nFound: Rebalance failed. See logs for detailed reason. You can try again.')

13:17:09       cscheme = <traceback_with_variables.color.ColorScheme object at 0x7f9a67ef6dd0>

13:17:09   File "/home/couchbase/jenkins/workspace/ns-server-cluster-tests/ns_server/cluster_tests/testlib/testlib.py", line 203, in apply_with_seed

13:17:09     return getattr(obj, func)(*args)

13:17:09       obj = <testsets.hard_reset_test.HardResetTests object at 0x7f9a66e041d0>

13:17:09       func = 'hard_reset_timeout_before_failover_add_node_test'

13:17:09       args = []

13:17:09       seed = b'\xe7\x1bp\xf7\xbf\xc0\xccMXa\x82\xae4\x8b\xbfl'

13:17:09       rand_state = (3, (2147483648, 548738641, 2310014575, 904221336, 1184853676, 2138714585, 202092079, 1457365115, 1201876102, 75035968, 2415587986, 2614516969, 1902697633, 3356926413, 43755069, 1525913765, 62827374, 3626047610, 2491787950, 14373715, 2016590401, 372552817, 2761142435, 3782391484, 6232639, 1040326943, 4058187564, 1273738009, 1216887744, 4224800154, 3646011918, 3567010848, 1224652181, 1387199792, 3156387121, 136790963, 3875432586, 1060055346, 1632721767, 2670604253, 1950208612, 158445762, 1032531906, 1365977042, 1762153167, 1136990933, 3069287899, 3266261218, 872026493, 1110231976, 769667073, 947830109, 225201989, 1865292606, 3110648526, 2648659540, 583887651, 2494558848, 1712456811, 3580865464, 588294489, 895080431, 2106162704, 1955433875, 3523269845, 93576890, 4185691806, 3010022169, 4277507767, 1075011569, 2333541026, 1236532598, 70156441, 83392581, 2048211664, 3413351134, 2824533669, 1970315118, 2027221593, 1034245600, 1417370984, 1805806358, 2697212911, 584000181, 1590950626, 381213...

13:17:09   File "/home/couchbase/jenkins/workspace/ns-server-cluster-tests/ns_server/cluster_tests/testsets/hard_reset_test.py", line 162, in hard_reset_timeout_before_failover_add_node_test

13:17:09     self.hard_reset_timeout_before_failover_testbase(self.cluster.add_node)

13:17:09       self = <testsets.hard_reset_test.HardResetTests object at 0x7f9a66e041d0>

13:17:09   File "/home/couchbase/jenkins/workspace/ns-server-cluster-tests/ns_server/cluster_tests/testsets/hard_reset_test.py", line 149, in hard_reset_timeout_before_failover_testbase

13:17:09     self.cluster.rebalance(wait=True, verbose=True)

13:17:09       self = <testsets.hard_reset_test.HardResetTests object at 0x7f9a66e041d0>

13:17:09       add_node_fun = <bound method Cluster.add_node of {'_nodes': [{'url': 'http://127.0.0.1:9010', 'hostname_cached': '127.0.0.1:9010', 'host': '127.0.0.1', 'port': 9010, 'auth': ('Administrator', 'asdasd'), 'data_path_cache': None, 'tls_port_cache': 19010, 'services_cached': None}, {'url': 'http://127.0.0.1:9011', 'hostname_cached': '127.0.0.1:9011', 'host': '127.0.0.1', 'port': 9011, 'auth': ('Administrator', 'asdasd'), 'data_path_cache': None, 'tls_port_cache': 19011, 'services_cached': None}], 'connected_nodes': [{'url': 'http://127.0.0.1:9010', 'hostname_cached': '127.0.0.1:9010', 'host': '127.0.0.1', 'port': 9010, 'auth': ('Administrator', 'asdasd'), 'data_path_cache': None, 'tls_port_cache': 19010, 'services_cached': None}, {'url': 'http://127.0.0.1:9011', 'hostname_cached': '127.0.0.1:9011', 'host': '127.0.0.1', 'port': 9011, 'auth': ('Administrator', 'asdasd'), 'data_path_cache': None, 'tls_port_cache': 19011, 'services_cached': None}], 'first_node_index': 10, 'index': 7, 'processes': [<Popen: re...

13:17:09       node = {'url': 'http://127.0.0.1:9010', 'hostname_cached': '127.0.0.1:9010', 'host': '127.0.0.1', 'port': 9010, 'auth': ('Administrator', 'asdasd'), 'data_path_cache': None, 'tls_port_cache': 19010, 'services_cached': None}

13:17:09       other_node = {'url': 'http://127.0.0.1:9011', 'hostname_cached': '127.0.0.1:9011', 'host': '127.0.0.1', 'port': 9011, 'auth': ('Administrator', 'asdasd'), 'data_path_cache': None, 'tls_port_cache': 19011, 'services_cached': None}

13:17:09       r = <Response [500]>

13:17:09       otp_other_node = 'n_11@127.0.0.1'

13:17:09   File "/home/couchbase/jenkins/workspace/ns-server-cluster-tests/ns_server/cluster_tests/testlib/cluster.py", line 227, in rebalance

13:17:09     assert error is expected_error, \

13:17:09       self = {'_nodes': [{'url': 'http://127.0.0.1:9010', 'hostname_cached': '127.0.0.1:9010', 'host': '127.0.0.1', 'port': 9010, 'auth': ('Administrator', 'asdasd'), 'data_path_cache': None, 'tls_port_cache': 19010, 'services_cached': None}, {'url': 'http://127.0.0.1:9011', 'hostname_cached': '127.0.0.1:9011', 'host': '127.0.0.1', 'port': 9011, 'auth': ('Administrator', 'asdasd'), 'data_path_cache': None, 'tls_port_cache': 19011, 'services_cached': None}], 'connected_nodes': [{'url': 'http://127.0.0.1:9010', 'hostname_cached': '127.0.0.1:9010', 'host': '127.0.0.1', 'port': 9010, 'auth': ('Administrator', 'asdasd'), 'data_path_cache': None, 'tls_port_cache': 19010, 'services_cached': None}, {'url': 'http://127.0.0.1:9011', 'hostname_cached': '127.0.0.1:9011', 'host': '127.0.0.1', 'port': 9011, 'auth': ('Administrator', 'asdasd'), 'data_path_cache': None, 'tls_port_cache': 19011, 'services_cached': None}], 'first_node_index': 10, 'index': 7, 'processes': [<Popen: returncode: None args: ['erl', '+MMm...

13:17:09       ejected_nodes = None

13:17:09       wait = True

13:17:09       timeout_s = 600

13:17:09       verbose = True

13:17:09       expected_error = None

13:17:09       initial_code = 200

13:17:09       initial_expected_error = None

13:17:09       known_nodes_string = 'n_10@127.0.0.1,n_11@127.0.0.1'

13:17:09       ejected_nodes_string = ''

13:17:09       data = {'knownNodes': 'n_10@127.0.0.1,n_11@127.0.0.1', 'ejectedNodes': ''}

13:17:09       resp = <Response [200]>

13:17:09       failed_nodes = []

13:17:09       error = 'Rebalance failed. See logs for detailed reason. You can try again.'

13:17:09       failed_hostnames = []

13:17:09       otp_nodes = {'127.0.0.1:9010': 'n_10@127.0.0.1', '127.0.0.1:9011': 'n_11@127.0.0.1'}

13:17:09 builtins.AssertionError: Expected final rebalance status: None

13:17:09 Found: Rebalance failed. See logs for detailed reason. You can try again.

13:17:09   HardResetTests.hard_reset_timeout_before_failover_join_cluster_test...  failed [27s]

13:17:36     AssertionError: Expected final rebalance status: None

13:17:36 Found: Rebalance failed. See logs for detailed reason. You can try again.

After a quick look it seems like it fails because indexer agent crashes:

[user:error,2023-12-11T21:17:08.873Z,n_10@127.0.0.1:<0.2609.1>:ns_orchestrator:log_rebalance_completion:1661]Rebalance exited with reason {service_rebalance_failed,index,

                              {agent_died,<35633.28642.0>,

                               {lost_connection,{'n_11@127.0.0.1',shutdown}}}}.

Rebalance Operation Id = d1a807f735c701e671cb1af2b3cfb93c

Which seems to be connected to panic in indexer at node n_11:

2023-12-11T21:17:06.906+00:00 [Error] RequestHandler::getIndexStatus: Error while retrieving 127.0.0.1:9168/getLocalIndexMetadata with auth Fail to unmarshal response from 127.0.0.1:9168 Error Count after pass: 1

panic: runtime error: invalid memory address or nil pointer dereference

[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xe5429c]

goroutine 72 [running]:

github.com/couchbase/indexing/secondary/manager.(*LifecycleMgr).broadcastStats(0xc004a30840)

	/home/couchbase/jenkins/workspace/ns-server-cluster-tests/goproj/src/github.com/couchbase/indexing/secondary/manager/lifecycle.go:3016 +0x21c

created by github.com/couchbase/indexing/secondary/manager.(*LifecycleMgr).Run in goroutine 1

	/home/couchbase/jenkins/workspace/ns-server-cluster-tests/goproj/src/github.com/couchbase/indexing/secondary/manager/lifecycle.go:271 +0xe5

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

test_cluster_data-7.zip
3.37 MB
11/Dec/23 3:20 PM

Issue Links

duplicates

MB-60063 "panic: runtime error: invalid memory address or nil pointer dereference" observed in indexer logs

Closed

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Dhruvil Shah

Reporter:: Timofey Barmin

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 11/Dec/23 3:22 PM

Updated:: 30/Jan/24 9:44 AM

Resolved:: 11/Dec/23 9:08 PM

Gerrit Reviews

There are no open Gerrit changes

[Cluster test failure] hard_reset_timeout_before_failover_add_node_test

Details

Description

Attachments

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty