Details
-
Bug
-
Resolution: Incomplete
-
Major
-
None
-
6.6.1
-
None
-
Enterprise Edition 6.6.1 build 9123 ‧ IPv4 © 2020 Couchbase, Inc.
-
Untriaged
-
Centos 64-bit
-
1
-
No
Description
Script to Repro
./testrunner -i /tmp/win10-bucket-ops.ini rerun=False -t volumetests.test_system_orchestrator_heartbeats_and_timeouts.volume.test_volume_MB_41562,nodes_init=6,initial_load=500000,replicas=2
|
Steps to Repro
1) Set Non default orchestrator heartbeats and timeouts
curl http://localhost:9000/diag/eval -u Administrator:asdasd -d 'ns_config:set({mb_master, heartbeat_interval}, 500).'
|
curl http://localhost:9000/diag/eval -u Administrator:asdasd -d 'ns_config:set({mb_master, timeout_interval_count}, 3).’
|
curl http://localhost:9000/diag/eval -u Administrator:asdasd -d 'ns_config:set({leader_lease_acquire_worker, lease_time}, 5000).'
|
curl http://localhost:9000/diag/eval -u Administrator:asdasd -d 'ns_config:set({leader_lease_acquire_worker, lease_grace_time}, 2000).'
|
curl http://localhost:9000/diag/eval -u Administrator:asdasd -d 'ns_config:set({leader_lease_acquire_worker, lease_renew_after}, 500).'
|
2. Create a 6 node cluster and 4 buckets.
------------------------------------
Nodes | Services | Status |
------------------------------------
172.23.105.175 | kv | Cluster node |
172.23.106.250 | index | Cluster node |
172.23.106.236 | kv | Cluster node |
172.23.106.251 | n1ql | Cluster node |
172.23.106.233 | kv | Cluster node |
172.23.106.238 | kv | Cluster node |
3. Create data on 4 buckets.
2020-10-12 08:57:09,473 | test | INFO | MainThread | [table_view:display:72] Bucket statistics
------------------------------------------------------------------------
Bucket | Type | Replicas | Durability | TTL | Items | RAM Quota | RAM Used | Disk Used |
------------------------------------------------------------------------
bucket1 | membase | 2 | none | 0 | 500000 | 19960692736 | 393151040 | 314841262 |
bucket2 | membase | 2 | none | 0 | 500000 | 19960692736 | 386728464 | 315357753 |
bucket3 | membase | 2 | none | 0 | 500000 | 19960692736 | 401526144 | 278419032 |
bucket4 | membase | 2 | none | 0 | 500000 | 19960692736 | 427735520 | 777525742 |
------------------------------------------------------------------------
4. Start data load and do a rebalance in of node 172.23.120.87
2020-10-12 08:57:12,467 | test | INFO | pool-1-thread-3 | [table_view:display:72] Rebalance Overview
------------------------------------
Nodes | Services | Status |
------------------------------------
172.23.105.175 | kv | Cluster node |
172.23.106.250 | index | Cluster node |
172.23.106.236 | kv | Cluster node |
172.23.106.251 | n1ql | Cluster node |
172.23.106.233 | kv | Cluster node |
172.23.106.238 | kv | Cluster node |
172.23.120.87 | None | <--- IN — |
------------------------------------
5. Find the orchestrator node(172.23.105.175), kill babysitter on orchestrator, do a hard failover, start couchbase-server, start delta recovery and rebalance
2020-10-12 09:01:16,887 | test | INFO | MainThread | [test_system_orchestrator_heartbeats_and_timeouts:kill_babysittter_process:252] Killing babysitter on : 172.23.105.175
|
2020-10-12 09:01:18,410 | test | INFO | MainThread | [test_system_orchestrator_heartbeats_and_timeouts:kill_babysitter_orchestrator_hard_failover_recovery_and_rebalance:523] Number of failed over nodes : 1
|
2020-10-12 09:01:18,950 | test | INFO | MainThread | [test_system_orchestrator_heartbeats_and_timeouts:start_couchbase_server:259] Starting couchbase server on : 172.23.105.175
|
2020-10-12 09:13:19,786 | test | INFO | MainThread | [test_system_orchestrator_heartbeats_and_timeouts:kill_babysitter_orchestrator_hard_failover_recovery_and_rebalance:537] Doing a delta recovery
|
2020-10-12 09:13:50,394 | test | INFO | pool-1-thread-2 | [table_view:display:72] Rebalance Overview
|
+----------------+----------+--------------+
|
| Nodes | Services | Status |
|
+----------------+----------+--------------+
|
| 172.23.105.175 | kv | Cluster node |
|
| 172.23.106.250 | index | Cluster node |
|
| 172.23.106.236 | kv | Cluster node |
|
| 172.23.106.251 | n1ql | Cluster node |
|
| 172.23.106.233 | kv | Cluster node |
|
| 172.23.106.238 | kv | Cluster node |
|
| 172.23.120.87 | kv | Cluster node |
|
+----------------+----------+--------------+
|
|
2020-10-12 09:13:55,414 | test | INFO | pool-1-thread-2 | [task:check:304] Rebalance - status: none, progress: 100
|
2020-10-12 09:13:55,446 | test | INFO | pool-1-thread-2 | [task:check:363] Rebalance completed with progress: 100% in 5.05200004578 sec
|
2020-10-12 09:18:00,467 | test | INFO | MainThread | [rest_client:rebalance_reached:105] Rebalance reached >100% in 245.019999981 seconds
|
5. Repeat the above step again, this time orchestrator is 172.23.106.233.
However this time rebalance fails as shown below
Rebalance exited with reason {prepare_rebalance_failed,
|
{error,
|
{failed_nodes,
|
[{'ns_1@172.23.106.233',{error,timeout}}]}}}.
|
Rebalance Operation Id = b2c5b4fea1b0b74c08a0789f1f73d073
|
I have tried the above tests around 10 times, every single time it failed in the 2nd time of step 4.
cbcollect_info attached.