Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-50888

Timeout in prepare_delta_recovery_bucket

    XMLWordPrintable

Details

    • Triaged
    • Centos 64-bit
    • 1
    • No
    • KV 2022-Feb

    Description

      [EDIT] Clone of MB-47387 focusing on run where timeouts have been increased to 5 minutes and we see a timeout in prepare_delta_recovery_bucket.
      Relevant comments have been copied from MB-47387 to the comments of this ticket.

      ---------------------------------------------------------------------------------------

      Noticed this failure again on the weekly run we had on 7.1.0-2223.

      Script to Repro

      guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/testexec.57437.ini GROUP=hard_failover_recovery_P0_set1,rerun=False,disk_optimized_thread_settings=True,get-cbcollect-info=True,autoCompactionDefined=true,upgrade_version=7.1.0-2223 -t bucket_collections.collections_rebalance.CollectionsRebalance.test_data_load_collections_with_hard_failover_recovery,doc_size=512,data_load_stage=during,step_count=1,process_concurrency=2,upgrade_version=7.1.0-2223,rerun=False,GROUP=hard_failover_recovery_P0_set1,bucket_spec=magma_dgm.80_percent_dgm.5_node_2_replica_magma_512,data_load_spec=volume_test_load_with_CRUD_on_collections,batch_size=5000,nodes_failover=1,get-cbcollect-info=True,durability=PERSIST_TO_MAJORITY,recovery_type=delta,skip_validations=False,nodes_init=5,disk_optimized_thread_settings=True,autoCompactionDefined=true,randomize_value=True,override_spec_params=durability'
      

      Steps to Repro
      1. Create a 5 node cluster.

      2022-02-05 04:39:52,098 | test  | INFO    | MainThread | [table_view:display:72] Cluster statistics
      +----------------+----------+-----------------+-----------+-----------+---------------------+-------------------+-----------------------+
      | Node           | Services | CPU_utilization | Mem_total | Mem_free  | Swap_mem_used       | Active / Replica  | Version               |
      +----------------+----------+-----------------+-----------+-----------+---------------------+-------------------+-----------------------+
      | 172.23.106.202 | kv       | 0               | 0.0 Byte  | 0.0 Byte  | 0.0 Byte / 0.0 Byte | 0 / 0             | 7.1.0-2223-enterprise |
      | 172.23.106.203 | kv       | 1.11809045226   | 11.45 GiB | 10.73 GiB | 0.0 Byte / 0.0 Byte | 0 / 0             | 7.1.0-2223-enterprise |
      | 172.23.106.204 | kv       | 0.401002506266  | 11.45 GiB | 10.83 GiB | 0.0 Byte / 3.50 GiB | 0 / 0             | 7.1.0-2223-enterprise |
      | 172.23.106.207 | kv       | 0.112711333751  | 11.45 GiB | 10.79 GiB | 0.0 Byte / 3.50 GiB | 0 / 0             | 7.1.0-2223-enterprise |
      | 172.23.106.206 | kv       | 0.891959798995  | 11.45 GiB | 10.68 GiB | 0.0 Byte / 3.50 GiB | 0 / 0             | 7.1.0-2223-enterprise |
      +----------------+----------+-----------------+-----------+-----------+---------------------+-------------------+-----------------------+
      

      2. Create buckets/scopes/collections/data.

      2022-02-05 05:11:57,193 | test  | INFO    | MainThread | [table_view:display:72] Bucket statistics
      +---------+-----------+-----------------+----------+------------+-----+---------+-----------+------------+------------+---------------+
      | Bucket  | Type      | Storage Backend | Replicas | Durability | TTL | Items   | RAM Quota | RAM Used   | Disk Used  | ARR           |
      +---------+-----------+-----------------+----------+------------+-----+---------+-----------+------------+------------+---------------+
      | bucket1 | couchbase | couchstore      | 2        | none       | 0   | 100000  | 9.77 GiB  | 320.66 MiB | 396.06 MiB | 100           |
      | bucket2 | couchbase | magma           | 2        | none       | 0   | 50000   | 4.88 GiB  | 566.53 MiB | 636.45 MiB | 100           |
      | default | couchbase | magma           | 2        | none       | 0   | 2035000 | 2.50 GiB  | 1.76 GiB   | 4.71 GiB   | 95.7214250614 |
      +---------+-----------+-----------------+----------+------------+-----+---------+-----------+------------+------------+---------------+
      

      3. Hard failover a node (172.23.106.202). Start durability data load in background(PERSIST_TO_MAJORITY).

      2022-02-05 05:12:01,740 | test  | INFO    | MainThread | [collections_rebalance:rebalance_operation:360] Starting rebalance operation of type : hard_failover_recovery
      2022-02-05 05:12:01,740 | test  | INFO    | MainThread | [collections_rebalance:rebalance_operation:755] failing over nodes [ip:172.23.106.202 port:8091 ssh_username:root]
      2022-02-05 05:12:10,914 | test  | INFO    | pool-3-thread-20 | [rest_client:monitorRebalance:1599] Rebalance done. Taken 8.03999996185 seconds to complete
      2022-02-05 05:40:23,533 | test  | WARNING | MainThread | [rest_client:get_nodes:1885] 172.23.106.202 - Node not part of cluster inactiveFailed
      

      4. Do a delta recovery of the node and do a rebalance.

      2022-02-05 05:40:28,940 | test  | INFO    | pool-3-thread-4 | [table_view:display:72] Rebalance Overview
      +----------------+----------+-----------------------+----------------+--------------+
      | Nodes          | Services | Version               | CPU            | Status       |
      +----------------+----------+-----------------------+----------------+--------------+
      | 172.23.106.202 | kv       | 7.1.0-2223-enterprise | 0.463368816531 | Cluster node |
      | 172.23.106.203 | kv       | 7.1.0-2223-enterprise | 16.0137807835  | Cluster node |
      | 172.23.106.204 | kv       | 7.1.0-2223-enterprise | 20.6513310933  | Cluster node |
      | 172.23.106.207 | kv       | 7.1.0-2223-enterprise | 16.0648207222  | Cluster node |
      | 172.23.106.206 | kv       | 7.1.0-2223-enterprise | 18.4466019417  | Cluster node |
      +----------------+----------+-----------------------+----------------+--------------+
      

      This rebalance fails as shown below.
      172.23.106.206

      2022-02-05 05:45:39,825 | test  | ERROR   | pool-3-thread-4 | [rest_client:print_UI_logs:2831] {u'code': 0, u'module': u'ns_memcached', u'type': u'info', u'node': u'ns_1@172.23.106.202', u'tstamp': 1644068736810L, u'shortText': u'message', u'serverTime': u'2022-02-05T05:45:36.810Z', u'text': u'Shutting down bucket "bucket1" on \'ns_1@172.23.106.202\' for deletion'}
      2022-02-05 05:45:39,825 | test  | ERROR   | pool-3-thread-4 | [rest_client:print_UI_logs:2831] {u'code': 0, u'module': u'ns_orchestrator', u'type': u'critical', u'node': u'ns_1@172.23.106.206', u'tstamp': 1644068736808L, u'shortText': u'message', u'serverTime': u'2022-02-05T05:45:36.808Z', u'text': u'Rebalance exited with reason {prepare_delta_recovery_failed,"bucket2",\n                              {error,\n                               {failed_nodes,\n                                [{\'ns_1@172.23.106.202\',{error,timeout}}]}}}.\nRebalance Operation Id = 98b4679736af1d667ebfbc3990cfbac0'}
      2022-02-05 05:45:39,825 | test  | ERROR   | pool-3-thread-4 | [rest_client:print_UI_logs:2831] {u'code': 0, u'module': u'ns_memcached', u'type': u'info', u'node': u'ns_1@172.23.106.202', u'tstamp': 1644068663742L, u'shortText': u'message', u'serverTime': u'2022-02-05T05:44:23.742Z', u'text': u'Bucket "bucket2" loaded on node \'ns_1@172.23.106.202\' in 35 seconds.'}
      2022-02-05 05:45:39,825 | test  | ERROR   | pool-3-thread-4 | [rest_client:print_UI_logs:2831] {u'code': 0, u'module': u'ns_memcached', u'type': u'info', u'node': u'ns_1@172.23.106.202', u'tstamp': 1644068431625L, u'shortText': u'message', u'serverTime': u'2022-02-05T05:40:31.625Z', u'text': u'Bucket "bucket1" loaded on node \'ns_1@172.23.106.202\' in 0 seconds.'}
      2022-02-05 05:45:39,826 | test  | ERROR   | pool-3-thread-4 | [rest_client:print_UI_logs:2831] {u'code': 0, u'module': u'ns_orchestrator', u'type': u'info', u'node': u'ns_1@172.23.106.206', u'tstamp': 1644068428875L, u'shortText': u'message', u'serverTime': u'2022-02-05T05:40:28.875Z', u'text': u"Starting rebalance, KeepNodes = ['ns_1@172.23.106.202','ns_1@172.23.106.203',\n                                 'ns_1@172.23.106.204','ns_1@172.23.106.207',\n                                 'ns_1@172.23.106.206'], EjectNodes = [], Failed over and being ejected nodes = [], Delta recovery nodes = ['ns_1@172.23.106.202'],  Delta recovery buckets = all; Operation Id = 98b4679736af1d667ebfbc3990cfbac0"}
      

      Supportal
      http://supportal.couchbase.com/snapshot/4cf367937edc82805458c0fa58ea460d::0

      s3://cb-customers-secure/mb-47387_2223/2022-02-07/collectinfo-2022-02-05t134623-ns_1@172.23.106.202.zip
      s3://cb-customers-secure/mb-47387_2223/2022-02-07/collectinfo-2022-02-05t134623-ns_1@172.23.106.203.zip
      s3://cb-customers-secure/mb-47387_2223/2022-02-07/collectinfo-2022-02-05t134623-ns_1@172.23.106.204.zip
      s3://cb-customers-secure/mb-47387_2223/2022-02-07/collectinfo-2022-02-05t134623-ns_1@172.23.106.206.zip
      s3://cb-customers-secure/mb-47387_2223/2022-02-07/collectinfo-2022-02-05t134623-ns_1@172.23.106.207.zip

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              owend Daniel Owen
              owend Daniel Owen
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty