Loading...

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: 7.1.0
Affects Version/s: 7.1.0
Component/s: couchbase-bucket
Labels:
Environment:
7.1.0-1066-enterprise

Triage:
Triaged
Operating System:
Centos 64-bit
Epic Link:
KV: Magma
Story Points:
1
Is this a Regression?:
No
Sprint:
KV 2022-Feb

Description

[EDIT] Clone of ~~MB-47387~~ focusing on run where timeouts have been increased to 5 minutes and we see a timeout in prepare_delta_recovery_bucket.
Relevant comments have been copied from ~~MB-47387~~ to the comments of this ticket.

---------------------------------------------------------------------------------------

Noticed this failure again on the weekly run we had on 7.1.0-2223.

Script to Repro

guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/testexec.57437.ini GROUP=hard_failover_recovery_P0_set1,rerun=False,disk_optimized_thread_settings=True,get-cbcollect-info=True,autoCompactionDefined=true,upgrade_version=7.1.0-2223 -t bucket_collections.collections_rebalance.CollectionsRebalance.test_data_load_collections_with_hard_failover_recovery,doc_size=512,data_load_stage=during,step_count=1,process_concurrency=2,upgrade_version=7.1.0-2223,rerun=False,GROUP=hard_failover_recovery_P0_set1,bucket_spec=magma_dgm.80_percent_dgm.5_node_2_replica_magma_512,data_load_spec=volume_test_load_with_CRUD_on_collections,batch_size=5000,nodes_failover=1,get-cbcollect-info=True,durability=PERSIST_TO_MAJORITY,recovery_type=delta,skip_validations=False,nodes_init=5,disk_optimized_thread_settings=True,autoCompactionDefined=true,randomize_value=True,override_spec_params=durability'

Steps to Repro
1. Create a 5 node cluster.

2022-02-05 04:39:52,098 | test  | INFO    | MainThread | [table_view:display:72] Cluster statistics

+----------------+----------+-----------------+-----------+-----------+---------------------+-------------------+-----------------------+

| Node           | Services | CPU_utilization | Mem_total | Mem_free  | Swap_mem_used       | Active / Replica  | Version               |

+----------------+----------+-----------------+-----------+-----------+---------------------+-------------------+-----------------------+

| 172.23.106.202 | kv       | 0               | 0.0 Byte  | 0.0 Byte  | 0.0 Byte / 0.0 Byte | 0 / 0             | 7.1.0-2223-enterprise |

| 172.23.106.203 | kv       | 1.11809045226   | 11.45 GiB | 10.73 GiB | 0.0 Byte / 0.0 Byte | 0 / 0             | 7.1.0-2223-enterprise |

| 172.23.106.204 | kv       | 0.401002506266  | 11.45 GiB | 10.83 GiB | 0.0 Byte / 3.50 GiB | 0 / 0             | 7.1.0-2223-enterprise |

| 172.23.106.207 | kv       | 0.112711333751  | 11.45 GiB | 10.79 GiB | 0.0 Byte / 3.50 GiB | 0 / 0             | 7.1.0-2223-enterprise |

| 172.23.106.206 | kv       | 0.891959798995  | 11.45 GiB | 10.68 GiB | 0.0 Byte / 3.50 GiB | 0 / 0             | 7.1.0-2223-enterprise |

+----------------+----------+-----------------+-----------+-----------+---------------------+-------------------+-----------------------+

2. Create buckets/scopes/collections/data.

2022-02-05 05:11:57,193 | test  | INFO    | MainThread | [table_view:display:72] Bucket statistics

+---------+-----------+-----------------+----------+------------+-----+---------+-----------+------------+------------+---------------+

| Bucket  | Type      | Storage Backend | Replicas | Durability | TTL | Items   | RAM Quota | RAM Used   | Disk Used  | ARR           |

+---------+-----------+-----------------+----------+------------+-----+---------+-----------+------------+------------+---------------+

| bucket1 | couchbase | couchstore      | 2        | none       | 0   | 100000  | 9.77 GiB  | 320.66 MiB | 396.06 MiB | 100           |

| bucket2 | couchbase | magma           | 2        | none       | 0   | 50000   | 4.88 GiB  | 566.53 MiB | 636.45 MiB | 100           |

| default | couchbase | magma           | 2        | none       | 0   | 2035000 | 2.50 GiB  | 1.76 GiB   | 4.71 GiB   | 95.7214250614 |

+---------+-----------+-----------------+----------+------------+-----+---------+-----------+------------+------------+---------------+

3. Hard failover a node (172.23.106.202). Start durability data load in background(PERSIST_TO_MAJORITY).

2022-02-05 05:12:01,740 | test  | INFO    | MainThread | [collections_rebalance:rebalance_operation:360] Starting rebalance operation of type : hard_failover_recovery

2022-02-05 05:12:01,740 | test  | INFO    | MainThread | [collections_rebalance:rebalance_operation:755] failing over nodes [ip:172.23.106.202 port:8091 ssh_username:root]

2022-02-05 05:12:10,914 | test  | INFO    | pool-3-thread-20 | [rest_client:monitorRebalance:1599] Rebalance done. Taken 8.03999996185 seconds to complete

2022-02-05 05:40:23,533 | test  | WARNING | MainThread | [rest_client:get_nodes:1885] 172.23.106.202 - Node not part of cluster inactiveFailed

4. Do a delta recovery of the node and do a rebalance.

2022-02-05 05:40:28,940 | test  | INFO    | pool-3-thread-4 | [table_view:display:72] Rebalance Overview

+----------------+----------+-----------------------+----------------+--------------+

| Nodes          | Services | Version               | CPU            | Status       |

+----------------+----------+-----------------------+----------------+--------------+

| 172.23.106.202 | kv       | 7.1.0-2223-enterprise | 0.463368816531 | Cluster node |

| 172.23.106.203 | kv       | 7.1.0-2223-enterprise | 16.0137807835  | Cluster node |

| 172.23.106.204 | kv       | 7.1.0-2223-enterprise | 20.6513310933  | Cluster node |

| 172.23.106.207 | kv       | 7.1.0-2223-enterprise | 16.0648207222  | Cluster node |

| 172.23.106.206 | kv       | 7.1.0-2223-enterprise | 18.4466019417  | Cluster node |

+----------------+----------+-----------------------+----------------+--------------+

This rebalance fails as shown below.
172.23.106.206

2022-02-05 05:45:39,825 | test  | ERROR   | pool-3-thread-4 | [rest_client:print_UI_logs:2831] {u'code': 0, u'module': u'ns_memcached', u'type': u'info', u'node': u'ns_1@172.23.106.202', u'tstamp': 1644068736810L, u'shortText': u'message', u'serverTime': u'2022-02-05T05:45:36.810Z', u'text': u'Shutting down bucket "bucket1" on \'ns_1@172.23.106.202\' for deletion'}

2022-02-05 05:45:39,825 | test  | ERROR   | pool-3-thread-4 | [rest_client:print_UI_logs:2831] {u'code': 0, u'module': u'ns_orchestrator', u'type': u'critical', u'node': u'ns_1@172.23.106.206', u'tstamp': 1644068736808L, u'shortText': u'message', u'serverTime': u'2022-02-05T05:45:36.808Z', u'text': u'Rebalance exited with reason {prepare_delta_recovery_failed,"bucket2",\n                              {error,\n                               {failed_nodes,\n                                [{\'ns_1@172.23.106.202\',{error,timeout}}]}}}.\nRebalance Operation Id = 98b4679736af1d667ebfbc3990cfbac0'}

2022-02-05 05:45:39,825 | test  | ERROR   | pool-3-thread-4 | [rest_client:print_UI_logs:2831] {u'code': 0, u'module': u'ns_memcached', u'type': u'info', u'node': u'ns_1@172.23.106.202', u'tstamp': 1644068663742L, u'shortText': u'message', u'serverTime': u'2022-02-05T05:44:23.742Z', u'text': u'Bucket "bucket2" loaded on node \'ns_1@172.23.106.202\' in 35 seconds.'}

2022-02-05 05:45:39,825 | test  | ERROR   | pool-3-thread-4 | [rest_client:print_UI_logs:2831] {u'code': 0, u'module': u'ns_memcached', u'type': u'info', u'node': u'ns_1@172.23.106.202', u'tstamp': 1644068431625L, u'shortText': u'message', u'serverTime': u'2022-02-05T05:40:31.625Z', u'text': u'Bucket "bucket1" loaded on node \'ns_1@172.23.106.202\' in 0 seconds.'}

2022-02-05 05:45:39,826 | test  | ERROR   | pool-3-thread-4 | [rest_client:print_UI_logs:2831] {u'code': 0, u'module': u'ns_orchestrator', u'type': u'info', u'node': u'ns_1@172.23.106.206', u'tstamp': 1644068428875L, u'shortText': u'message', u'serverTime': u'2022-02-05T05:40:28.875Z', u'text': u"Starting rebalance, KeepNodes = ['ns_1@172.23.106.202','ns_1@172.23.106.203',\n                                 'ns_1@172.23.106.204','ns_1@172.23.106.207',\n                                 'ns_1@172.23.106.206'], EjectNodes = [], Failed over and being ejected nodes = [], Delta recovery nodes = ['ns_1@172.23.106.202'],  Delta recovery buckets = all; Operation Id = 98b4679736af1d667ebfbc3990cfbac0"}

Supportal
http://supportal.couchbase.com/snapshot/4cf367937edc82805458c0fa58ea460d::0

s3://cb-customers-secure/mb-47387_2223/2022-02-07/collectinfo-2022-02-05t134623-ns_1@172.23.106.202.zip
s3://cb-customers-secure/mb-47387_2223/2022-02-07/collectinfo-2022-02-05t134623-ns_1@172.23.106.203.zip
s3://cb-customers-secure/mb-47387_2223/2022-02-07/collectinfo-2022-02-05t134623-ns_1@172.23.106.204.zip
s3://cb-customers-secure/mb-47387_2223/2022-02-07/collectinfo-2022-02-05t134623-ns_1@172.23.106.206.zip
s3://cb-customers-secure/mb-47387_2223/2022-02-07/collectinfo-2022-02-05t134623-ns_1@172.23.106.207.zip

Attachments

Issue Links

Clones

MB-47387 [Magma] - Opening bucket during recovery takes a long time

Closed

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Timeout in prepare_delta_recovery_bucket

Details

Description

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty