Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
7.1.0
-
7.1.0-1036-enterprise
-
Untriaged
-
Centos 64-bit
-
1
-
No
Description
Script to Repro
guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/win10-bucket-ops.ini rerun=False,get-cbcollect-info=True,quota_percent=95,crash_warning=True,bucket_storage=magma,enable_dp=True,retry_get_process_num=200 -t bucket_collections.collections_rebalance.CollectionsRebalance.test_data_load_collections_with_graceful_failover_rebalance_out,nodes_init=5,nodes_failover=2,bucket_spec=multi_bucket.buckets_for_rebalance_tests,data_load_stage=after,scrape_interval=2,rebalance_moves_per_node=64,skip_validations=False,GROUP=P0_failover_and_rebalance_out'
|
Steps to Repro
1. Create a 5 node cluster
2021-06-23 21:09:04,476 | test | INFO | pool-5-thread-6 | [table_view:display:72] Rebalance Overview
----------------------------------------------------------------------
Nodes | Services | Version | CPU | Status |
----------------------------------------------------------------------
172.23.98.196 | kv | 7.1.0-1036-enterprise | 5.39450466347 | Cluster node |
172.23.98.195 | None | <--- IN — | ||
172.23.121.10 | None | <--- IN — | ||
172.23.104.186 | None | <--- IN — | ||
172.23.120.201 | None | <--- IN — |
----------------------------------------------------------------------
2. Create bucket/scopes/collections/data
2021-06-23 21:14:21,836 | test | INFO | MainThread | [table_view:display:72] Bucket statistics
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Bucket | Type | Replicas | Durability | TTL | Items | RAM Quota | RAM Used | Disk Used |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
4yZIQSEJJD6IzNnFF4kWkPgVRxWd0tC-1DBmRsF05LD1MVBbaPNwgg2vNjO8SCRd_syOC9QCrPVlNpP_x-15-414000 | couchbase | 2 | none | 0 | 3000000 | 10485760000 | 2613283152 | 8185163251 |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
3. Change few setting and graceful failover 2 nodes(172.23.104.186 and 172.23.120.201)
2021-06-23 21:14:29,920 | test | INFO | MainThread | [collections_rebalance:setUp:59] Changing scrape interval to 2
|
2021-06-23 21:14:32,463 | test | INFO | MainThread | [cluster_ready_functions:set_rebalance_moves_per_nodes:129] Changed Rebalance settings: {u'rebalanceMovesPerNode': 64}
|
2021-06-23 21:14:32,464 | test | INFO | MainThread | [collections_rebalance:load_collections_with_rebalance:932] Doing collection data load after graceful_failover_rebalance_out
|
2021-06-23 21:14:32,466 | test | INFO | MainThread | [collections_rebalance:rebalance_operation:390] Starting rebalance operation of type : graceful_failover_rebalance_out
|
2021-06-23 21:14:32,467 | test | INFO | MainThread | [collections_rebalance:rebalance_operation:602] failing over nodes [ip:172.23.104.186 port:8091 ssh_username:root, ip:172.23.120.201 port:8091 ssh_username:root]
|
4. Wait for graceful failover to complete
2021-06-23 21:21:03,559 | test | WARNING | MainThread | [rest_client:get_nodes:1756] 172.23.104.186 - Node not part of cluster inactiveFailed
|
2021-06-23 21:21:03,561 | test | WARNING | MainThread | [rest_client:get_nodes:1756] 172.23.120.201 - Node not part of cluster inactiveFailed
|
5. Rebalance out the node
2021-06-23 21:21:59,946 | test | INFO | pool-5-thread-11 | [table_view:display:72] Rebalance Overview
-----------------------------------------------------------------------
Nodes | Services | Version | CPU | Status |
-----------------------------------------------------------------------
172.23.98.196 | kv | 7.1.0-1036-enterprise | 10.5732484076 | Cluster node |
172.23.98.195 | kv | 7.1.0-1036-enterprise | 24.7787610619 | Cluster node |
172.23.104.186 | kv | 7.1.0-1036-enterprise | 1.00755667506 | — OUT ---> |
172.23.120.201 | kv | 7.1.0-1036-enterprise | 0.752823086575 | — OUT ---> |
172.23.121.10 | kv | 7.1.0-1036-enterprise | 12.2474747475 | Cluster node |
-----------------------------------------------------------------------
6. Rebalance fails as in rebalance_failure.txt.
cbcollect_info attached.
Attachments
Issue Links
- mentioned in
-
Page Loading...
In case someone is interested in assessing the blast radius - the circumstances under which this can happen can be quite random. First, a log line needs to exceed 50K, the chance of that happening is during rebalance, where we print vbmap, which is usually large, especially with large bucket names, and if there are/or large number nodes. Then, the log line truncation needs to happen right on an element of "map" data type. I'll guestimate that number of customers upgrading to 7.0.0 in production with these kind of setups is not high.