Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-55932

CDC: Delta recovery rebalance failed with 'timeout' error message

    XMLWordPrintable

Details

    Description

      Build: 7.2.0-5242

      Steps:

      • Cluster setup

         
        +----------------+----------+-----------------+-----------+-----------+
        | Node           | Services | CPU_utilization | Mem_total | Mem_free  |
        +----------------+----------+-----------------+-----------+-----------+
        | 172.23.106.94  | kv       | 1.55735873766   | 11.74 GiB | 10.93 GiB |
        | 172.23.106.87  | kv       | 2.19224316187   | 11.74 GiB | 10.77 GiB |
        | 172.23.106.92  | kv       | 0               | 11.74 GiB | 10.84 GiB |
        | 172.23.107.147 | kv       | 0.575418225985  | 11.74 GiB | 10.84 GiB |
        +----------------+----------+-----------------+-----------+-----------+
         
        +---------+-----------+-----------------+----------+----------+-----------+------------+------------+---------------+
        | Bucket  | Type      | Storage Backend | Replicas | Items    | RAM Quota | RAM Used   | Disk Used  | ARR           |
        +---------+-----------+-----------------+----------+----------+-----------+------------+------------+---------------+
        | bucket1 | couchbase | couchstore      | 1        | 99000    | 7.81 GiB  | 122.72 MiB | 130.06 MiB | 100           |
        | bucket2 | couchbase | magma           | 1        | 49500    | 3.91 GiB  | 253.15 MiB | 197.57 MiB | 100           |
        | default | couchbase | magma           | 1        | 19496700 | 2.00 GiB  | 1.35 GiB   | 4.81 GiB   | 19.6319479707 |
        +---------+-----------+-----------------+----------+----------+-----------+------------+------------+---------------+
        

      • Load initial data and trigger disk fo on node 172.23.106.94

        {u'code': 0, u'module': u'menelaus_web_alerts_srv', u'type': u'info', u'node': u'ns_1@172.23.106.94', u'tstamp': 1678566553989L, u'shortText': u'message',
          u'serverTime': u'2023-03-11T12:29:13.989Z', u'text': u'Approaching full disk warning. Usage of disk "/root" on node "172.23.106.94" is around 100%.'}
        

      • Recover the node and add-back (delta recovery) + trigger rebalance

      Observation:

      Rebalance failed with reason `timeout`

      2023-03-11 12:35:56,904 :: Adding back node 172.23.106.94
      {u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try again.', u'type': u'rebalance', u'masterRequestTimedOut': False, u'statusId': u'34a518175ed901f99862a525551600b5', u'subtype': u'rebalance', u'statusIsStale': False, u'lastReportURI': u'/logs/rebalanceReport?reportID=58cf9617f036f87676781eda12cddb53', u'status': u'notRunning'} - rebalance failed
      Latest logs from UI on 172.23.106.87:
      {u'code': 0, u'module': u'ns_memcached', u'type': u'info', u'node': u'ns_1@172.23.106.94', u'tstamp': 1678567257577L, u'shortText': u'message',
        u'serverTime': u'2023-03-11T12:40:57.577Z', u'text': u'Shutting down bucket "bucket1" on \'ns_1@172.23.106.94\' for deletion'}
      {u'code': 0, u'module': u'ns_orchestrator', u'type': u'critical', u'node': u'ns_1@172.23.106.87', u'tstamp': 1678567257572L, u'shortText': u'message',
        u'serverTime': u'2023-03-11T12:40:57.572Z', u'text': u'Rebalance exited with reason
          {prepare_delta_recovery_failed,"bucket1",{error,{failed_nodes,[{\'ns_1@172.23.106.94\',{error,timeout}}]}}}.
        Rebalance Operation Id = 0ee524e95ab99868a11b518dfc4fe7d3'}
      {u'code': 0, u'module': u'menelaus_web_alerts_srv', u'type': u'info', u'node': u'ns_1@172.23.106.94', u'tstamp': 1678566973995L, u'shortText': u'message',
        u'serverTime': u'2023-03-11T12:36:13.995Z', u'text': u'Write Commit Failure. Disk write failed for item in Bucket "bucket2" on node 172.23.106.94.'}
      {u'code': 0, u'module': u'menelaus_web_alerts_srv', u'type': u'info', u'node': u'ns_1@172.23.106.94', u'tstamp': 1678566973995L, u'shortText': u'message',
        u'serverTime': u'2023-03-11T12:36:13.995Z', u'text': u'Write Commit Failure. Disk write failed for item in Bucket "bucket1" on node 172.23.106.94.'}
      {u'code': 0, u'module': u'menelaus_web_alerts_srv', u'type': u'info', u'node': u'ns_1@172.23.106.94', u'tstamp': 1678566973995L, u'shortText': u'message',
        u'serverTime': u'2023-03-11T12:36:13.995Z', u'text': u'Write Commit Failure. Disk write failed for item in Bucket "default" on node 172.23.106.94.'}
      {u'code': 0, u'module': u'ns_memcached', u'type': u'info', u'node': u'ns_1@172.23.106.94', u'tstamp': 1678566957880L, u'shortText': u'message',
        u'serverTime': u'2023-03-11T12:35:57.880Z', u'text': u'Bucket "default" loaded on node \'ns_1@172.23.106.94\' in 0 seconds.'}
      {u'code': 0, u'module': u'ns_memcached', u'type': u'info', u'node': u'ns_1@172.23.106.94', u'tstamp': 1678566957602L, u'shortText': u'message',
        u'serverTime': u'2023-03-11T12:35:57.602Z', u'text': u'Bucket "bucket2" loaded on node \'ns_1@172.23.106.94\' in 0 seconds.'}
      {u'code': 0, u'module': u'ns_memcached', u'type': u'info', u'node': u'ns_1@172.23.106.94', u'tstamp': 1678566957570L, u'shortText': u'message',
        u'serverTime': u'2023-03-11T12:35:57.570Z', u'text': u'Bucket "bucket1" loaded on node \'ns_1@172.23.106.94\' in 0 seconds.'}
      {u'code': 0, u'module': u'ns_orchestrator', u'type': u'info', u'node': u'ns_1@172.23.106.87', u'tstamp': 1678566957054L, u'shortText': u'message',
        u'serverTime': u'2023-03-11T12:35:57.054Z', u'text': u"Starting rebalance, KeepNodes = ['ns_1@172.23.106.94','ns_1@172.23.106.87',\n                                 'ns_1@172.23.106.92','ns_1@172.23.107.147'], EjectNodes = [], Failed over and being ejected nodes = [], Delta recovery nodes = ['ns_1@172.23.106.94'],  Delta recovery buckets = all; Operation Id = 0ee524e95ab99868a11b518dfc4fe7d3"}
      {u'code': 0, u'module': u'menelaus_web_alerts_srv', u'type': u'info', u'node': u'ns_1@172.23.106.94', u'tstamp': 1678566913995L, u'shortText': u'message',
        u'serverTime': u'2023-03-11T12:35:13.995Z', u'text': u'Write Commit Failure. Disk write failed for item in Bucket "bucket2" on node 172.23.106.94.'}

      TAF test:

      guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/testexec.84460.ini GROUP=disk_fo,rerun=False,disk_optimized_thread_settings=True,get-cbcollect-info=True,autoCompactionDefined=true,dedupe_update_itrs=10000,upgrade_version=7.2.0-5242 -t failover.DiskFailoverTests.DiskAutofailoverTests.test_disk_autofailover_and_addback_of_node,timeout=10,num_node_failures=1,recovery_strategy=delta,failover_action=disk_full,nodes_init=4,disk_timeout=15,bucket_spec=magma_dgm.10_percent_dgm.4_node_1_replica_magma_512,doc_size=512,randomize_value=True,data_load_spec=volume_test_load_with_CRUD_on_collections,data_location=/root,crash_warning=True,default_history_retention_for_collections=false,bucket_history_retention_seconds=86400,bucket_history_retention_bytes=20000000000,GROUP=P0_set1;disk_fo'
      

       

       

      Attachments

        For Gerrit Dashboard: MB-55932
        # Subject Branch Project Status CR V

        Activity

          People

            rohan.suri Rohan Suri
            ashwin.govindarajulu Ashwin Govindarajulu
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty