Details
-
Bug
-
Resolution: Fixed
-
Critical
-
7.1.0
-
7.1.0-1066-enterprise
-
Triaged
-
Centos 64-bit
-
1
-
No
-
KV 2022-Feb
Description
[EDIT] Clone of MB-47387 focusing on run where timeouts have been increased to 5 minutes and we see a timeout in prepare_delta_recovery_bucket.
Relevant comments have been copied from MB-47387 to the comments of this ticket.
---------------------------------------------------------------------------------------
Noticed this failure again on the weekly run we had on 7.1.0-2223.
Script to Repro
guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/testexec.57437.ini GROUP=hard_failover_recovery_P0_set1,rerun=False,disk_optimized_thread_settings=True,get-cbcollect-info=True,autoCompactionDefined=true,upgrade_version=7.1.0-2223 -t bucket_collections.collections_rebalance.CollectionsRebalance.test_data_load_collections_with_hard_failover_recovery,doc_size=512,data_load_stage=during,step_count=1,process_concurrency=2,upgrade_version=7.1.0-2223,rerun=False,GROUP=hard_failover_recovery_P0_set1,bucket_spec=magma_dgm.80_percent_dgm.5_node_2_replica_magma_512,data_load_spec=volume_test_load_with_CRUD_on_collections,batch_size=5000,nodes_failover=1,get-cbcollect-info=True,durability=PERSIST_TO_MAJORITY,recovery_type=delta,skip_validations=False,nodes_init=5,disk_optimized_thread_settings=True,autoCompactionDefined=true,randomize_value=True,override_spec_params=durability'
|
Steps to Repro
1. Create a 5 node cluster.
2022-02-05 04:39:52,098 | test | INFO | MainThread | [table_view:display:72] Cluster statistics
|
+----------------+----------+-----------------+-----------+-----------+---------------------+-------------------+-----------------------+
|
| Node | Services | CPU_utilization | Mem_total | Mem_free | Swap_mem_used | Active / Replica | Version |
|
+----------------+----------+-----------------+-----------+-----------+---------------------+-------------------+-----------------------+
|
| 172.23.106.202 | kv | 0 | 0.0 Byte | 0.0 Byte | 0.0 Byte / 0.0 Byte | 0 / 0 | 7.1.0-2223-enterprise |
|
| 172.23.106.203 | kv | 1.11809045226 | 11.45 GiB | 10.73 GiB | 0.0 Byte / 0.0 Byte | 0 / 0 | 7.1.0-2223-enterprise |
|
| 172.23.106.204 | kv | 0.401002506266 | 11.45 GiB | 10.83 GiB | 0.0 Byte / 3.50 GiB | 0 / 0 | 7.1.0-2223-enterprise |
|
| 172.23.106.207 | kv | 0.112711333751 | 11.45 GiB | 10.79 GiB | 0.0 Byte / 3.50 GiB | 0 / 0 | 7.1.0-2223-enterprise |
|
| 172.23.106.206 | kv | 0.891959798995 | 11.45 GiB | 10.68 GiB | 0.0 Byte / 3.50 GiB | 0 / 0 | 7.1.0-2223-enterprise |
|
+----------------+----------+-----------------+-----------+-----------+---------------------+-------------------+-----------------------+
|
2. Create buckets/scopes/collections/data.
2022-02-05 05:11:57,193 | test | INFO | MainThread | [table_view:display:72] Bucket statistics
|
+---------+-----------+-----------------+----------+------------+-----+---------+-----------+------------+------------+---------------+
|
| Bucket | Type | Storage Backend | Replicas | Durability | TTL | Items | RAM Quota | RAM Used | Disk Used | ARR |
|
+---------+-----------+-----------------+----------+------------+-----+---------+-----------+------------+------------+---------------+
|
| bucket1 | couchbase | couchstore | 2 | none | 0 | 100000 | 9.77 GiB | 320.66 MiB | 396.06 MiB | 100 |
|
| bucket2 | couchbase | magma | 2 | none | 0 | 50000 | 4.88 GiB | 566.53 MiB | 636.45 MiB | 100 |
|
| default | couchbase | magma | 2 | none | 0 | 2035000 | 2.50 GiB | 1.76 GiB | 4.71 GiB | 95.7214250614 |
|
+---------+-----------+-----------------+----------+------------+-----+---------+-----------+------------+------------+---------------+
|
3. Hard failover a node (172.23.106.202). Start durability data load in background(PERSIST_TO_MAJORITY).
2022-02-05 05:12:01,740 | test | INFO | MainThread | [collections_rebalance:rebalance_operation:360] Starting rebalance operation of type : hard_failover_recovery
|
2022-02-05 05:12:01,740 | test | INFO | MainThread | [collections_rebalance:rebalance_operation:755] failing over nodes [ip:172.23.106.202 port:8091 ssh_username:root]
|
2022-02-05 05:12:10,914 | test | INFO | pool-3-thread-20 | [rest_client:monitorRebalance:1599] Rebalance done. Taken 8.03999996185 seconds to complete
|
2022-02-05 05:40:23,533 | test | WARNING | MainThread | [rest_client:get_nodes:1885] 172.23.106.202 - Node not part of cluster inactiveFailed
|
4. Do a delta recovery of the node and do a rebalance.
2022-02-05 05:40:28,940 | test | INFO | pool-3-thread-4 | [table_view:display:72] Rebalance Overview
|
+----------------+----------+-----------------------+----------------+--------------+
|
| Nodes | Services | Version | CPU | Status |
|
+----------------+----------+-----------------------+----------------+--------------+
|
| 172.23.106.202 | kv | 7.1.0-2223-enterprise | 0.463368816531 | Cluster node |
|
| 172.23.106.203 | kv | 7.1.0-2223-enterprise | 16.0137807835 | Cluster node |
|
| 172.23.106.204 | kv | 7.1.0-2223-enterprise | 20.6513310933 | Cluster node |
|
| 172.23.106.207 | kv | 7.1.0-2223-enterprise | 16.0648207222 | Cluster node |
|
| 172.23.106.206 | kv | 7.1.0-2223-enterprise | 18.4466019417 | Cluster node |
|
+----------------+----------+-----------------------+----------------+--------------+
|
This rebalance fails as shown below.
172.23.106.206
2022-02-05 05:45:39,825 | test | ERROR | pool-3-thread-4 | [rest_client:print_UI_logs:2831] {u'code': 0, u'module': u'ns_memcached', u'type': u'info', u'node': u'ns_1@172.23.106.202', u'tstamp': 1644068736810L, u'shortText': u'message', u'serverTime': u'2022-02-05T05:45:36.810Z', u'text': u'Shutting down bucket "bucket1" on \'ns_1@172.23.106.202\' for deletion'}
|
2022-02-05 05:45:39,825 | test | ERROR | pool-3-thread-4 | [rest_client:print_UI_logs:2831] {u'code': 0, u'module': u'ns_orchestrator', u'type': u'critical', u'node': u'ns_1@172.23.106.206', u'tstamp': 1644068736808L, u'shortText': u'message', u'serverTime': u'2022-02-05T05:45:36.808Z', u'text': u'Rebalance exited with reason {prepare_delta_recovery_failed,"bucket2",\n {error,\n {failed_nodes,\n [{\'ns_1@172.23.106.202\',{error,timeout}}]}}}.\nRebalance Operation Id = 98b4679736af1d667ebfbc3990cfbac0'}
|
2022-02-05 05:45:39,825 | test | ERROR | pool-3-thread-4 | [rest_client:print_UI_logs:2831] {u'code': 0, u'module': u'ns_memcached', u'type': u'info', u'node': u'ns_1@172.23.106.202', u'tstamp': 1644068663742L, u'shortText': u'message', u'serverTime': u'2022-02-05T05:44:23.742Z', u'text': u'Bucket "bucket2" loaded on node \'ns_1@172.23.106.202\' in 35 seconds.'}
|
2022-02-05 05:45:39,825 | test | ERROR | pool-3-thread-4 | [rest_client:print_UI_logs:2831] {u'code': 0, u'module': u'ns_memcached', u'type': u'info', u'node': u'ns_1@172.23.106.202', u'tstamp': 1644068431625L, u'shortText': u'message', u'serverTime': u'2022-02-05T05:40:31.625Z', u'text': u'Bucket "bucket1" loaded on node \'ns_1@172.23.106.202\' in 0 seconds.'}
|
2022-02-05 05:45:39,826 | test | ERROR | pool-3-thread-4 | [rest_client:print_UI_logs:2831] {u'code': 0, u'module': u'ns_orchestrator', u'type': u'info', u'node': u'ns_1@172.23.106.206', u'tstamp': 1644068428875L, u'shortText': u'message', u'serverTime': u'2022-02-05T05:40:28.875Z', u'text': u"Starting rebalance, KeepNodes = ['ns_1@172.23.106.202','ns_1@172.23.106.203',\n 'ns_1@172.23.106.204','ns_1@172.23.106.207',\n 'ns_1@172.23.106.206'], EjectNodes = [], Failed over and being ejected nodes = [], Delta recovery nodes = ['ns_1@172.23.106.202'], Delta recovery buckets = all; Operation Id = 98b4679736af1d667ebfbc3990cfbac0"}
|
Supportal
http://supportal.couchbase.com/snapshot/4cf367937edc82805458c0fa58ea460d::0
s3://cb-customers-secure/mb-47387_2223/2022-02-07/collectinfo-2022-02-05t134623-ns_1@172.23.106.202.zip
s3://cb-customers-secure/mb-47387_2223/2022-02-07/collectinfo-2022-02-05t134623-ns_1@172.23.106.203.zip
s3://cb-customers-secure/mb-47387_2223/2022-02-07/collectinfo-2022-02-05t134623-ns_1@172.23.106.204.zip
s3://cb-customers-secure/mb-47387_2223/2022-02-07/collectinfo-2022-02-05t134623-ns_1@172.23.106.206.zip
s3://cb-customers-secure/mb-47387_2223/2022-02-07/collectinfo-2022-02-05t134623-ns_1@172.23.106.207.zip
Attachments
Issue Links
- Clones
-
MB-47387 [Magma] - Opening bucket during recovery takes a long time
- Closed