Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-45545

[Collections] - Setting cluster fails with "Rebalance exited with reason {buckets_cleanup_failed,['ns_1@172.23.96.196']}" in DGM jobs

    XMLWordPrintable

Details

    • Untriaged
    • Centos 64-bit
    • 1
    • Yes

    Description

      Scripts to Repro

      {'GROUP': 'failover_with_collection_crud_durability_MAJORITY_AND_PERSIST_TO_ACTIVE_dgm',
       'cluster_name': 'testexec.62576',
       'conf_file': 'conf/collections/collections_failover_crud_on_collections_dgm.conf',
       'ini': '/tmp/testexec.62576.ini',
       'num_nodes': 5,
       'rerun': 'False',
       'spec': 'collections_failover_crud_on_collections_dgm',
       'upgrade_version': '7.0.0-4889'}
      

      Its not a particular test thats failing. So, in our conf file we have 6 tests, 2 tests pass fine. From 3rd test onwards all the subsequent test fails in setting up a cluster.

      Attaching the last failure from test_6 so that we get a chance to look at all the logs/failures before too.

      Script

      guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/testexec.62576.ini GROUP=failover_with_collection_crud_durability_MAJORITY_AND_PERSIST_TO_ACTIVE_dgm,rerun=False,upgrade_version=7.0.0-4889 -t bucket_collections.collections_rebalance.CollectionsRebalance.test_data_load_collections_with_hard_failover_recovery,nodes_init=5,nodes_failover=1,recovery_type=delta,override_spec_params=durability;replicas,durability=MAJORITY_AND_PERSIST_TO_ACTIVE,step_count=1,replicas=2,bucket_spec=dgm.buckets_for_rebalance_tests_more_collections,data_load_spec=volume_test_load_with_CRUD_on_collections,data_load_stage=during,num_items=32000,dgm=60,skip_validations=False,GROUP=failover_with_collection_crud_durability_MAJORITY_AND_PERSIST_TO_ACTIVE_dgm'
      

      Following rebalance fails
      2021-04-08 01:15:23,651 | test | INFO | pool-8-thread-6 | [table_view:display:72] Rebalance Overview
      ----------------------++-------------------------------------------------

      Nodes Services Version CPU Status

      ----------------------++-------------------------------------------------

      172.23.122.58 kv 7.0.0-4889-enterprise 3.68129097327 Cluster node
      172.23.97.218 None     <--- IN —
      172.23.96.197 None     <--- IN —
      172.23.96.220 None     <--- IN —
      172.23.96.196 None     <--- IN —

      ----------------------++-------------------------------------------------

      2021-04-08 01:15:28,684 | test  | ERROR   | pool-8-thread-6 | [rest_client:_rebalance_status_and_progress:1510] {u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try again.', u'type': u'rebalance', u'masterRequestTimedOut': False, u'statusId': u'f905a8a731f10858dc7fe0a8aecaa6c6', u'statusIsStale': False, u'lastReportURI': u'/logs/rebalanceReport?reportID=b0296a1a146d458d46f1e229d387714e', u'status': u'notRunning'} - rebalance failed
      2021-04-08 01:15:28,710 | test  | INFO    | pool-8-thread-6 | [rest_client:print_UI_logs:2611] Latest logs from UI on 172.23.122.58:
      2021-04-08 01:15:28,711 | test  | ERROR   | pool-8-thread-6 | [rest_client:print_UI_logs:2613] {u'code': 0, u'module': u'ns_orchestrator', u'type': u'critical', u'node': u'ns_1@172.23.122.58', u'tstamp': 1617869723778L, u'shortText': u'message', u'serverTime': u'2021-04-08T01:15:23.778Z', u'text': u"Rebalance exited with reason {buckets_cleanup_failed,['ns_1@172.23.96.196']}.\nRebalance Operation Id = d623f07b93a35d71d94d86a93dab5dbf"}
      2021-04-08 01:15:28,711 | test  | ERROR   | pool-8-thread-6 | [rest_client:print_UI_logs:2613] {u'code': 0, u'module': u'ns_rebalancer', u'type': u'critical', u'node': u'ns_1@172.23.122.58', u'tstamp': 1617869723777L, u'shortText': u'message', u'serverTime': u'2021-04-08T01:15:23.777Z', u'text': u"Failed to cleanup old buckets on node 'ns_1@172.23.96.196': {error,ebusy}"}
      2021-04-08 01:15:28,711 | test  | ERROR   | pool-8-thread-6 | [rest_client:print_UI_logs:2613] {u'code': 0, u'module': u'ns_couchdb_api', u'type': u'critical', u'node': u'ns_1@172.23.96.196', u'tstamp': 1617869723776L, u'shortText': u'message', u'serverTime': u'2021-04-08T01:15:23.776Z', u'text': u'Unable to delete bucket database directory overlay\n{error,ebusy}'}
      2021-04-08 01:15:28,711 | test  | ERROR   | pool-8-thread-6 | [rest_client:print_UI_logs:2613] {u'code': 0, u'module': u'ns_storage_conf', u'type': u'info', u'node': u'ns_1@172.23.96.196', u'tstamp': 1617869723769L, u'shortText': u'message', u'serverTime': u'2021-04-08T01:15:23.769Z', u'text': u'Deleting old data files of bucket "overlay"'}
      2021-04-08 01:15:28,711 | test  | ERROR   | pool-8-thread-6 | [rest_client:print_UI_logs:2613] {u'code': 0, u'module': u'memcached_config_mgr', u'type': u'info', u'node': u'ns_1@172.23.96.196', u'tstamp': 1617869723710L, u'shortText': u'message', u'serverTime': u'2021-04-08T01:15:23.710Z', u'text': u'Hot-reloaded memcached.json for config change of the following keys: [<<"scramsha_fallback_salt">>]'}
      2021-04-08 01:15:28,713 | test  | ERROR   | pool-8-thread-6 | [rest_client:print_UI_logs:2613] {u'code': 0, u'module': u'ns_orchestrator', u'type': u'info', u'node': u'ns_1@172.23.122.58', u'tstamp': 1617869723623L, u'shortText': u'message', u'serverTime': u'2021-04-08T01:15:23.623Z', u'text': u"Starting rebalance, KeepNodes = ['ns_1@172.23.96.197','ns_1@172.23.97.218',\n                                 'ns_1@172.23.96.220','ns_1@172.23.96.196',\n                                 'ns_1@172.23.122.58'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = d623f07b93a35d71d94d86a93dab5dbf"}
      2021-04-08 01:15:28,713 | test  | ERROR   | pool-8-thread-6 | [rest_client:print_UI_logs:2613] {u'code': 3, u'module': u'ns_cluster', u'type': u'info', u'node': u'ns_1@172.23.96.196', u'tstamp': 1617869723595L, u'shortText': u'message', u'serverTime': u'2021-04-08T01:15:23.595Z', u'text': u'Node ns_1@172.23.96.196 joined cluster'}
      

      This was not seen on 7.0.0-4854. cbcollect_info attached.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            Balakumaran.Gopal Balakumaran Gopal
            Balakumaran.Gopal Balakumaran Gopal
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty