Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-60093

[Cluster test failure] hard_reset_timeout_before_failover_add_node_test

    XMLWordPrintable

Details

    • Untriaged
    • 0
    • Unknown

    Description

      https://cv.jenkins.couchbase.com/job/ns-server-cluster-tests/6613/console

      After calling /controller/hardResetNode for node n_11 and failing it over, we are trying to add that node back and call rebalance, but the rebalance fails.

      13:16:42   HardResetTests.hard_reset_timeout_before_failover_add_node_test...      failed [26s]
      13:17:09     AssertionError: Expected final rebalance status: None
      13:17:09 Found: Rebalance failed. See logs for detailed reason. You can try again.
      13:17:09 ================== HardResetTests.hard_reset_timeout_before_failover_add_node_test output begin =================
      13:17:09 sending POST http://127.0.0.1:9011/diag/eval {'data': 'ns_config:set({node, node(), {timeout, {ns_cluster, hard_reset}}}, 10)', 'timeout': 60} (expected code 200)
      13:17:09 result: 200
      13:17:09 sending POST http://127.0.0.1:9011/controller/hardResetNode {'timeout': 60} (expected code 500)
      13:17:09 result: 500
      13:17:09 sending GET http://127.0.0.1:9011/pools/default {'timeout': 60} (expected code None)
      13:17:09 result: 200
      13:17:09 sending GET http://127.0.0.1:9011/pools/default {'timeout': 60} (expected code None)
      13:17:09 got exception: HTTPConnectionPool(host='127.0.0.1', port=9011): Max retries exceeded with url: /pools/default (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9a657f6550>: Failed to establish a new connection: [Errno 111] Connection refused'))
      13:17:09 sending GET http://127.0.0.1:9011/pools/default {'timeout': 60} (expected code None)
      13:17:09 got exception: HTTPConnectionPool(host='127.0.0.1', port=9011): Max retries exceeded with url: /pools/default (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9a657c8490>: Failed to establish a new connection: [Errno 111] Connection refused'))
      13:17:09 sending GET http://127.0.0.1:9011/pools/default {'timeout': 60} (expected code None)
      13:17:09 got exception: HTTPConnectionPool(host='127.0.0.1', port=9011): Max retries exceeded with url: /pools/default (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9a657c1f50>: Failed to establish a new connection: [Errno 111] Connection refused'))
      13:17:09 sending GET http://127.0.0.1:9011/pools/default {'timeout': 60} (expected code None)
      13:17:09 got exception: HTTPConnectionPool(host='127.0.0.1', port=9011): Max retries exceeded with url: /pools/default (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9a65793f50>: Failed to establish a new connection: [Errno 111] Connection refused'))
      13:17:09 sending GET http://127.0.0.1:9011/pools/default {'timeout': 60} (expected code None)
      13:17:09 got exception: HTTPConnectionPool(host='127.0.0.1', port=9011): Max retries exceeded with url: /pools/default (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9a65b61e10>: Failed to establish a new connection: [Errno 111] Connection refused'))
      13:17:09 sending GET http://127.0.0.1:9011/pools/default {'timeout': 60} (expected code None)
      13:17:09 got exception: HTTPConnectionPool(host='127.0.0.1', port=9011): Max retries exceeded with url: /pools/default (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9a6585a990>: Failed to establish a new connection: [Errno 111] Connection refused'))
      13:17:09 sending GET http://127.0.0.1:9011/pools/default {'timeout': 60} (expected code None)
      13:17:09 got exception: HTTPConnectionPool(host='127.0.0.1', port=9011): Max retries exceeded with url: /pools/default (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9a657c1650>: Failed to establish a new connection: [Errno 111] Connection refused'))
      13:17:09 sending GET http://127.0.0.1:9011/pools/default {'timeout': 60} (expected code None)
      13:17:09 got exception: HTTPConnectionPool(host='127.0.0.1', port=9011): Max retries exceeded with url: /pools/default (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9a657c9810>: Failed to establish a new connection: [Errno 111] Connection refused'))
      13:17:09 sending GET http://127.0.0.1:9011/pools/default {'timeout': 60} (expected code None)
      13:17:09 result: 404
      13:17:09 sending GET http://127.0.0.1:9011/pools/default {'timeout': 60} (expected code 404)
      13:17:09 result: 404
      13:17:09 sending GET http://127.0.0.1:9010/pools/default {'timeout': 60} (expected code 200)
      13:17:09 result: 200
      13:17:09 sending GET http://127.0.0.1:9010/pools/default/terseClusterInfo {'timeout': 60} (expected code 200)
      13:17:09 result: 200
      13:17:09 sending GET http://127.0.0.1:9010/pools/nodes {'timeout': 60} (expected code 200)
      13:17:09 result: 200
      13:17:09 sending GET http://127.0.0.1:9011/nodes/self {'timeout': 60} (expected code 200)
      13:17:09 result: 200
      13:17:09 Failing over node {'user': 'Administrator', 'password': 'asdasd', 'otpNode': 'n_11@127.0.0.1', 'allowUnsafe': 'true'}
      13:17:09 sending POST http://127.0.0.1:9010/controller/startFailover {'data': {'user': 'Administrator', 'password': 'asdasd', 'otpNode': 'n_11@127.0.0.1', 'allowUnsafe': 'true'}, 'timeout': 60} (expected code 200)
      13:17:09 result: 200
      13:17:09 Waiting up to 600s for rebalance to finish. Finished.
      13:17:09 sending POST http://127.0.0.1:9010/controller/addNode {'data': {'user': 'Administrator', 'password': 'asdasd', 'hostname': 'https://127.0.0.1:19011', 'services': 'kv,index'}, 'timeout': 60} (expected code 200)
      13:17:09 result: 200
      13:17:09 sending GET http://127.0.0.1:9010/nodeStatuses {'timeout': 60} (expected code None)
      13:17:09 result: 200
      13:17:09 Starting rebalance with {'knownNodes': 'n_10@127.0.0.1,n_11@127.0.0.1', 'ejectedNodes': ''}
      13:17:09 sending GET http://127.0.0.1:9010/pools/default {'timeout': 60} (expected code 200)
      13:17:09 result: 200
      13:17:09 sending POST http://127.0.0.1:9010/controller/rebalance {'data': {'knownNodes': 'n_10@127.0.0.1,n_11@127.0.0.1', 'ejectedNodes': ''}, 'timeout': 60} (expected code 200)
      13:17:09 result: 200
      13:17:09 Waiting up to 600s for rebalance to finish............. Finished.
      13:17:09 =================== HardResetTests.hard_reset_timeout_before_failover_add_node_test output end ==================
      13:17:09 
      13:17:09 Traceback with variables (most recent call last):
      13:17:09   File "/home/couchbase/jenkins/workspace/ns-server-cluster-tests/ns_server/cluster_tests/testlib/testlib.py", line 190, in safe_test_function_call
      13:17:09     res = apply_with_seed(testset, testfunction, args, seed)
      13:17:09       testset = <testsets.hard_reset_test.HardResetTests object at 0x7f9a66e041d0>
      13:17:09       testfunction = 'hard_reset_timeout_before_failover_add_node_test'
      13:17:09       args = []
      13:17:09       testiter = 0
      13:17:09       verbose = True
      13:17:09       intercept_output = True
      13:17:09       seed = b'\xe7\x1bp\xf7\xbf\xc0\xccMXa\x82\xae4\x8b\xbfl'
      13:17:09       dry_run = False
      13:17:09       res = None
      13:17:09       error = None
      13:17:09       testname = 'HardResetTests.hard_reset_timeout_before_failover_add_node_test'
      13:17:09       report_call = <contextlib._GeneratorContextManager object at 0x7f9a670861d0>
      13:17:09       e = AssertionError('Expected final rebalance status: None\nFound: Rebalance failed. See logs for detailed reason. You can try again.')
      13:17:09       cscheme = <traceback_with_variables.color.ColorScheme object at 0x7f9a67ef6dd0>
      13:17:09   File "/home/couchbase/jenkins/workspace/ns-server-cluster-tests/ns_server/cluster_tests/testlib/testlib.py", line 203, in apply_with_seed
      13:17:09     return getattr(obj, func)(*args)
      13:17:09       obj = <testsets.hard_reset_test.HardResetTests object at 0x7f9a66e041d0>
      13:17:09       func = 'hard_reset_timeout_before_failover_add_node_test'
      13:17:09       args = []
      13:17:09       seed = b'\xe7\x1bp\xf7\xbf\xc0\xccMXa\x82\xae4\x8b\xbfl'
      13:17:09       rand_state = (3, (2147483648, 548738641, 2310014575, 904221336, 1184853676, 2138714585, 202092079, 1457365115, 1201876102, 75035968, 2415587986, 2614516969, 1902697633, 3356926413, 43755069, 1525913765, 62827374, 3626047610, 2491787950, 14373715, 2016590401, 372552817, 2761142435, 3782391484, 6232639, 1040326943, 4058187564, 1273738009, 1216887744, 4224800154, 3646011918, 3567010848, 1224652181, 1387199792, 3156387121, 136790963, 3875432586, 1060055346, 1632721767, 2670604253, 1950208612, 158445762, 1032531906, 1365977042, 1762153167, 1136990933, 3069287899, 3266261218, 872026493, 1110231976, 769667073, 947830109, 225201989, 1865292606, 3110648526, 2648659540, 583887651, 2494558848, 1712456811, 3580865464, 588294489, 895080431, 2106162704, 1955433875, 3523269845, 93576890, 4185691806, 3010022169, 4277507767, 1075011569, 2333541026, 1236532598, 70156441, 83392581, 2048211664, 3413351134, 2824533669, 1970315118, 2027221593, 1034245600, 1417370984, 1805806358, 2697212911, 584000181, 1590950626, 381213...
      13:17:09   File "/home/couchbase/jenkins/workspace/ns-server-cluster-tests/ns_server/cluster_tests/testsets/hard_reset_test.py", line 162, in hard_reset_timeout_before_failover_add_node_test
      13:17:09     self.hard_reset_timeout_before_failover_testbase(self.cluster.add_node)
      13:17:09       self = <testsets.hard_reset_test.HardResetTests object at 0x7f9a66e041d0>
      13:17:09   File "/home/couchbase/jenkins/workspace/ns-server-cluster-tests/ns_server/cluster_tests/testsets/hard_reset_test.py", line 149, in hard_reset_timeout_before_failover_testbase
      13:17:09     self.cluster.rebalance(wait=True, verbose=True)
      13:17:09       self = <testsets.hard_reset_test.HardResetTests object at 0x7f9a66e041d0>
      13:17:09       add_node_fun = <bound method Cluster.add_node of {'_nodes': [{'url': 'http://127.0.0.1:9010', 'hostname_cached': '127.0.0.1:9010', 'host': '127.0.0.1', 'port': 9010, 'auth': ('Administrator', 'asdasd'), 'data_path_cache': None, 'tls_port_cache': 19010, 'services_cached': None}, {'url': 'http://127.0.0.1:9011', 'hostname_cached': '127.0.0.1:9011', 'host': '127.0.0.1', 'port': 9011, 'auth': ('Administrator', 'asdasd'), 'data_path_cache': None, 'tls_port_cache': 19011, 'services_cached': None}], 'connected_nodes': [{'url': 'http://127.0.0.1:9010', 'hostname_cached': '127.0.0.1:9010', 'host': '127.0.0.1', 'port': 9010, 'auth': ('Administrator', 'asdasd'), 'data_path_cache': None, 'tls_port_cache': 19010, 'services_cached': None}, {'url': 'http://127.0.0.1:9011', 'hostname_cached': '127.0.0.1:9011', 'host': '127.0.0.1', 'port': 9011, 'auth': ('Administrator', 'asdasd'), 'data_path_cache': None, 'tls_port_cache': 19011, 'services_cached': None}], 'first_node_index': 10, 'index': 7, 'processes': [<Popen: re...
      13:17:09       node = {'url': 'http://127.0.0.1:9010', 'hostname_cached': '127.0.0.1:9010', 'host': '127.0.0.1', 'port': 9010, 'auth': ('Administrator', 'asdasd'), 'data_path_cache': None, 'tls_port_cache': 19010, 'services_cached': None}
      13:17:09       other_node = {'url': 'http://127.0.0.1:9011', 'hostname_cached': '127.0.0.1:9011', 'host': '127.0.0.1', 'port': 9011, 'auth': ('Administrator', 'asdasd'), 'data_path_cache': None, 'tls_port_cache': 19011, 'services_cached': None}
      13:17:09       r = <Response [500]>
      13:17:09       otp_other_node = 'n_11@127.0.0.1'
      13:17:09   File "/home/couchbase/jenkins/workspace/ns-server-cluster-tests/ns_server/cluster_tests/testlib/cluster.py", line 227, in rebalance
      13:17:09     assert error is expected_error, \
      13:17:09       self = {'_nodes': [{'url': 'http://127.0.0.1:9010', 'hostname_cached': '127.0.0.1:9010', 'host': '127.0.0.1', 'port': 9010, 'auth': ('Administrator', 'asdasd'), 'data_path_cache': None, 'tls_port_cache': 19010, 'services_cached': None}, {'url': 'http://127.0.0.1:9011', 'hostname_cached': '127.0.0.1:9011', 'host': '127.0.0.1', 'port': 9011, 'auth': ('Administrator', 'asdasd'), 'data_path_cache': None, 'tls_port_cache': 19011, 'services_cached': None}], 'connected_nodes': [{'url': 'http://127.0.0.1:9010', 'hostname_cached': '127.0.0.1:9010', 'host': '127.0.0.1', 'port': 9010, 'auth': ('Administrator', 'asdasd'), 'data_path_cache': None, 'tls_port_cache': 19010, 'services_cached': None}, {'url': 'http://127.0.0.1:9011', 'hostname_cached': '127.0.0.1:9011', 'host': '127.0.0.1', 'port': 9011, 'auth': ('Administrator', 'asdasd'), 'data_path_cache': None, 'tls_port_cache': 19011, 'services_cached': None}], 'first_node_index': 10, 'index': 7, 'processes': [<Popen: returncode: None args: ['erl', '+MMm...
      13:17:09       ejected_nodes = None
      13:17:09       wait = True
      13:17:09       timeout_s = 600
      13:17:09       verbose = True
      13:17:09       expected_error = None
      13:17:09       initial_code = 200
      13:17:09       initial_expected_error = None
      13:17:09       known_nodes_string = 'n_10@127.0.0.1,n_11@127.0.0.1'
      13:17:09       ejected_nodes_string = ''
      13:17:09       data = {'knownNodes': 'n_10@127.0.0.1,n_11@127.0.0.1', 'ejectedNodes': ''}
      13:17:09       resp = <Response [200]>
      13:17:09       failed_nodes = []
      13:17:09       error = 'Rebalance failed. See logs for detailed reason. You can try again.'
      13:17:09       failed_hostnames = []
      13:17:09       otp_nodes = {'127.0.0.1:9010': 'n_10@127.0.0.1', '127.0.0.1:9011': 'n_11@127.0.0.1'}
      13:17:09 builtins.AssertionError: Expected final rebalance status: None
      13:17:09 Found: Rebalance failed. See logs for detailed reason. You can try again.
      13:17:09   HardResetTests.hard_reset_timeout_before_failover_join_cluster_test...  failed [27s]
      13:17:36     AssertionError: Expected final rebalance status: None
      13:17:36 Found: Rebalance failed. See logs for detailed reason. You can try again.
      
      

      After a quick look it seems like it fails because indexer agent crashes:

      [user:error,2023-12-11T21:17:08.873Z,n_10@127.0.0.1:<0.2609.1>:ns_orchestrator:log_rebalance_completion:1661]Rebalance exited with reason {service_rebalance_failed,index,
                                    {agent_died,<35633.28642.0>,
                                     {lost_connection,{'n_11@127.0.0.1',shutdown}}}}.
      Rebalance Operation Id = d1a807f735c701e671cb1af2b3cfb93c
      

      Which seems to be connected to panic in indexer at node n_11:

      2023-12-11T21:17:06.906+00:00 [Error] RequestHandler::getIndexStatus: Error while retrieving 127.0.0.1:9168/getLocalIndexMetadata with auth Fail to unmarshal response from 127.0.0.1:9168 Error Count after pass: 1
      panic: runtime error: invalid memory address or nil pointer dereference
      [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xe5429c]
       
      goroutine 72 [running]:
      github.com/couchbase/indexing/secondary/manager.(*LifecycleMgr).broadcastStats(0xc004a30840)
      	/home/couchbase/jenkins/workspace/ns-server-cluster-tests/goproj/src/github.com/couchbase/indexing/secondary/manager/lifecycle.go:3016 +0x21c
      created by github.com/couchbase/indexing/secondary/manager.(*LifecycleMgr).Run in goroutine 1
      	/home/couchbase/jenkins/workspace/ns-server-cluster-tests/goproj/src/github.com/couchbase/indexing/secondary/manager/lifecycle.go:271 +0xe5
      

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              dhruvil.ketanshah Dhruvil Shah
              timofey.barmin Timofey Barmin
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty