Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Incomplete
Priority: Major
Fix Version/s: None
Affects Version/s: 6.6.1
Component/s: qe
Labels:
None
Environment:
Enterprise Edition 6.6.1 build 9123 ‧ IPv4 © 2020 Couchbase, Inc.

Triage:
Untriaged
Operating System:
Centos 64-bit
Story Points:
1
Is this a Regression?:
No

Description

Script to Repro

./testrunner -i /tmp/win10-bucket-ops.ini rerun=False -t volumetests.test_system_orchestrator_heartbeats_and_timeouts.volume.test_volume_MB_41562,nodes_init=6,initial_load=500000,replicas=2

Steps to Repro
1) Set Non default orchestrator heartbeats and timeouts

curl http://localhost:9000/diag/eval -u Administrator:asdasd -d 'ns_config:set({mb_master, heartbeat_interval}, 500).'

curl http://localhost:9000/diag/eval -u Administrator:asdasd -d 'ns_config:set({mb_master, timeout_interval_count}, 3).’

curl http://localhost:9000/diag/eval -u Administrator:asdasd -d 'ns_config:set({leader_lease_acquire_worker, lease_time}, 5000).'

curl http://localhost:9000/diag/eval -u Administrator:asdasd -d 'ns_config:set({leader_lease_acquire_worker, lease_grace_time}, 2000).'

curl http://localhost:9000/diag/eval -u Administrator:asdasd -d 'ns_config:set({leader_lease_acquire_worker, lease_renew_after}, 500).'

2. Create a 6 node cluster and 4 buckets.
------------------------------------

Nodes

Services

Status

------------------------------------

172.23.105.175	kv	Cluster node
172.23.106.250	index	Cluster node
172.23.106.236	kv	Cluster node
172.23.106.251	n1ql	Cluster node
172.23.106.233	kv	Cluster node
172.23.106.238	kv	Cluster node

3. Create data on 4 buckets.
2020-10-12 08:57:09,473 | test | INFO | MainThread | [table_view:display:72] Bucket statistics
------------------------------------------------------------------------

Bucket

Type

Replicas

Durability

TTL

Items

RAM Quota

RAM Used

Disk Used

------------------------------------------------------------------------

bucket1	membase	2	none	500000	19960692736	393151040	314841262
bucket2	membase	2	none	500000	19960692736	386728464	315357753
bucket3	membase	2	none	500000	19960692736	401526144	278419032
bucket4	membase	2	none	500000	19960692736	427735520	777525742

------------------------------------------------------------------------

4. Start data load and do a rebalance in of node 172.23.120.87
2020-10-12 08:57:12,467 | test | INFO | pool-1-thread-3 | [table_view:display:72] Rebalance Overview
------------------------------------

Nodes

Services

Status

------------------------------------

172.23.105.175	kv	Cluster node
172.23.106.250	index	Cluster node
172.23.106.236	kv	Cluster node
172.23.106.251	n1ql	Cluster node
172.23.106.233	kv	Cluster node
172.23.106.238	kv	Cluster node
172.23.120.87	None	<--- IN —

------------------------------------

5. Find the orchestrator node(172.23.105.175), kill babysitter on orchestrator, do a hard failover, start couchbase-server, start delta recovery and rebalance

2020-10-12 09:01:16,887 | test  | INFO    | MainThread | [test_system_orchestrator_heartbeats_and_timeouts:kill_babysittter_process:252] Killing babysitter on : 172.23.105.175

2020-10-12 09:01:18,410 | test  | INFO    | MainThread | [test_system_orchestrator_heartbeats_and_timeouts:kill_babysitter_orchestrator_hard_failover_recovery_and_rebalance:523] Number of failed over nodes : 1

2020-10-12 09:01:18,950 | test  | INFO    | MainThread | [test_system_orchestrator_heartbeats_and_timeouts:start_couchbase_server:259] Starting couchbase server on : 172.23.105.175

2020-10-12 09:13:19,786 | test  | INFO    | MainThread | [test_system_orchestrator_heartbeats_and_timeouts:kill_babysitter_orchestrator_hard_failover_recovery_and_rebalance:537] Doing a delta recovery

2020-10-12 09:13:50,394 | test  | INFO    | pool-1-thread-2 | [table_view:display:72] Rebalance Overview

+----------------+----------+--------------+

| Nodes          | Services | Status       |

+----------------+----------+--------------+

| 172.23.105.175 | kv       | Cluster node |

| 172.23.106.250 | index    | Cluster node |

| 172.23.106.236 | kv       | Cluster node |

| 172.23.106.251 | n1ql     | Cluster node |

| 172.23.106.233 | kv       | Cluster node |

| 172.23.106.238 | kv       | Cluster node |

| 172.23.120.87  | kv       | Cluster node |

+----------------+----------+--------------+

2020-10-12 09:13:55,414 | test  | INFO    | pool-1-thread-2 | [task:check:304] Rebalance - status: none, progress: 100

2020-10-12 09:13:55,446 | test  | INFO    | pool-1-thread-2 | [task:check:363] Rebalance completed with progress: 100% in 5.05200004578 sec

2020-10-12 09:18:00,467 | test  | INFO    | MainThread | [rest_client:rebalance_reached:105] Rebalance reached >100% in 245.019999981 seconds

5. Repeat the above step again, this time orchestrator is 172.23.106.233.
However this time rebalance fails as shown below

Rebalance exited with reason {prepare_rebalance_failed,

{error,

{failed_nodes,

[{'ns_1@172.23.106.233',{error,timeout}}]}}}.

Rebalance Operation Id = b2c5b4fea1b0b74c08a0789f1f73d073

I have tried the above tests around 10 times, every single time it failed in the 2nd time of step 4.

cbcollect_info attached.

Attachments

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Balakumaran Gopal

Reporter:: Balakumaran Gopal

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 12/Oct/20 9:50 AM

Updated:: 06/Nov/20 12:03 AM

Resolved:: 26/Oct/20 9:44 AM

Gerrit Reviews

There are no open Gerrit changes

Killing babysitter on orchestrator + hard failover + delta recovery + rebalance fails with prepare_rebalance_failed

Details

Description

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty