Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-62166

Rebalance failed while upgrading kv node via swap rebalance

    XMLWordPrintable

Details

    • Bug
    • Resolution: Not a Bug
    • Critical
    • 7.6.2
    • 7.6.2
    • ns_server
    • None
    • Untriaged
    • 0
    • Yes

    Description

      Build - 7.6.2-3688

      upgrade from 7.6.0-2183 to 7.6.2-3688

      Steps to repro

      • Cluster 2kv, 2index and 2 n1ql nodes
      • Create bucket and indexes 
      • The upgrade is done in the order - index-n1ql-index-n1ql-kv-kv
      • While swap upgrading the kv node 172.23.217.218 the rebalance failed with the below error

        {'status': 'none', 'errorMessage': 'Rebalance failed. See logs for detailed reason. You can try again.'} - rebalance failed
        [2024-06-03 10:55:11,378] - [on_prem_rest_client:4353] INFO - Latest logs from UI on 172.23.216.187:
        [2024-06-03 10:55:11,378] - [on_prem_rest_client:4354] ERROR - {'node': 'ns_1@172.23.216.66', 'type': 'critical', 'code': 0, 'module': 'ns_orchestrator', 'tstamp': 1717437297681, 'shortText': 'message', 'text': "Rebalance exited with reason {buckets_cleanup_failed,['ns_1@172.23.216.70']}.\nRebalance Operation Id = ab25e14657fb25b519ca2d81c5639505", 'serverTime': '2024-06-03T13:54:57.681Z'}
        [2024-06-03 10:55:11,379] - [on_prem_rest_client:4354] ERROR - {'node': 'ns_1@172.23.216.66', 'type': 'critical', 'code': 0, 'module': 'ns_rebalancer', 'tstamp': 1717437297680, 'shortText': 'message', 'text': "Failed to cleanup old buckets on node 'ns_1@172.23.216.70': {badrpc,\n                                                             {'EXIT',timeout}}", 'serverTime': '2024-06-03T13:54:57.680Z'}
        [2024-06-03 10:55:11,379] - [on_prem_rest_client:4354] ERROR - {'node': 'ns_1@172.23.217.218', 'type': 'info', 'code': 0, 'module': 'memcached_config_mgr', 'tstamp': 1717437273851, 'shortText': 'message', 'text': 'Hot-reloaded memcached.json for config change of the following keys: [<<"scramsha_fallback_salt">>]', 'serverTime': '2024-06-03T10:54:33.851Z'}
        [2024-06-03 10:55:11,379] - [on_prem_rest_client:4354] ERROR - {'node': 'ns_1@172.23.216.66', 'type': 'info', 'code': 0, 'module': 'ns_orchestrator', 'tstamp': 1717437273801, 'shortText': 'message', 'text': "Starting rebalance, KeepNodes = ['ns_1@172.23.216.187','ns_1@172.23.216.66',\n                                 'ns_1@172.23.216.70','ns_1@172.23.216.77',\n                                 'ns_1@172.23.217.103','ns_1@172.23.217.218'], EjectNodes = ['ns_1@172.23.106.6'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = ab25e14657fb25b519ca2d81c5639505", 'serverTime': '2024-06-03T13:54:33.801Z'}
        [2024-06-03 10:55:11,380] - [on_prem_rest_client:4354] ERROR - {'node': 'ns_1@172.23.217.218', 'type': 'info', 'code': 3, 'module': 'ns_cluster', 'tstamp': 1717437273564, 'shortText': 'message', 'text': 'Node ns_1@172.23.217.218 joined cluster', 'serverTime': '2024-06-03T10:54:33.564Z'}
        [2024-06-03 10:55:11,380] - [on_prem_rest_client:4354] ERROR - {'node': 'ns_1@172.23.217.218', 'type': 'info', 'code': 1, 'module': 'menelaus_web_sup', 'tstamp': 1717437273516, 'shortText': 'web start ok', 'text': 'Couchbase Server has started on web port 8091 on node \'ns_1@172.23.217.218\'. Version: "7.6.2-3688-enterprise".', 'serverTime': '2024-06-03T10:54:33.516Z'}
        [2024-06-03 10:55:11,380] - [on_prem_rest_client:4354] ERROR - {'node': 'ns_1@172.23.216.66', 'type': 'info', 'code': 4, 'module': 'ns_node_disco', 'tstamp': 1717437269315, 'shortText': 'node up', 'text': "Node 'ns_1@172.23.216.66' saw that node 'ns_1@172.23.217.218' came up. Tags: []", 'serverTime': '2024-06-03T13:54:29.315Z'}
        [2024-06-03 10:55:11,381] - [on_prem_rest_client:4354] ERROR - {'node': 'ns_1@172.23.216.70', 'type': 'info', 'code': 4, 'module': 'ns_node_disco', 'tstamp': 1717437269313, 'shortText': 'node up', 'text': "Node 'ns_1@172.23.216.70' saw that node 'ns_1@172.23.217.218' came up. Tags: []", 'serverTime': '2024-06-03T10:54:29.313Z'}
        [2024-06-03 10:55:11,381] - [on_prem_rest_client:4354] ERROR - {'node': 'ns_1@172.23.216.77', 'type': 'info', 'code': 4, 'module': 'ns_node_disco', 'tstamp': 1717437269311, 'shortText': 'node up', 'text': "Node 'ns_1@172.23.216.77' saw that node 'ns_1@172.23.217.218' came up. Tags: []", 'serverTime': '2024-06-04T06:54:29.311Z'}
        [2024-06-03 10:55:11,381] - [on_prem_rest_client:4354] ERROR - {'node': 'ns_1@172.23.217.103', 'type': 'info', 'code': 4, 'module': 'ns_node_disco', 'tstamp': 1717437269306, 'shortText': 'node up', 'text': "Node 'ns_1@172.23.217.103' saw that node 'ns_1@172.23.217.218' came up. Tags: []", 'serverTime': '2024-06-03T12:54:29.306Z'}
        [<FrameSummary file /usr/local/lib/python3.7/threading.py, line 890 in _bootstrap>, <FrameSummary file /usr/local/lib/python3.7/threading.py, line 926 in _bootstrap_inner>, <FrameSummary file lib/tasks/taskmanager.py, line 34 in run>, <FrameSummary file lib/tasks/task.py, line 113 in step>, <FrameSummary file lib/tasks/task.py, line 910 in check>, <FrameSummary file lib/tasks/future.py, line 265 in set_exception>] 

      • In the above error snipped the rebalance failed due to failed bucket cleanup on 172.23.216.70 which is a n1ql node

      Logs - 
      test_7.zip

      Attachments

        1. screenshot-1.png
          screenshot-1.png
          350 kB
        2. test_7.zip
          68.25 MB
        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            yash.dodderi Yash Dodderi
            yash.dodderi Yash Dodderi
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty