Details
-
Bug
-
Resolution: Fixed
-
Critical
-
7.1.0
-
7.1.0-1066-enterprise
-
Triaged
-
Centos 64-bit
-
1
-
No
Description
Script to Repro
guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/win10-bucket-ops.ini rerun=False,get-cbcollect-info=True,quota_percent=95,crash_warning=True,bucket_storage=magma,enable_dp=True,retry_get_process_num=200 -t bucket_collections.collections_rebalance.CollectionsRebalance.test_data_load_collections_with_graceful_failover_recovery,nodes_init=5,nodes_failover=2,recovery_type=delta,step_count=1,bucket_spec=multi_bucket.buckets_for_rebalance_tests,data_load_stage=during,skip_validations=False'
|
Steps to Repro
1. Create a 5 node cluster
2021-07-13 15:05:12,207 | test | INFO | pool-3-thread-6 | [table_view:display:72] Rebalance Overview
----------------------------------------------------------------------
Nodes | Services | Version | CPU | Status |
----------------------------------------------------------------------
172.23.98.196 | kv | 7.1.0-1066-enterprise | 1.73779708663 | Cluster node |
172.23.98.195 | None | <--- IN — | ||
172.23.120.206 | None | <--- IN — | ||
172.23.104.186 | None | <--- IN — | ||
172.23.120.201 | None | <--- IN — |
----------------------------------------------------------------------
2. Create bucket/scopes/collections/data
2021-07-13 15:10:04,253 | test | INFO | MainThread | [table_view:display:72] Bucket statistics
--------------------------------------------------------------------------------------------------------------------------------------------------+
Bucket | Type | Storage Backend | Replicas | Durability | TTL | Items | RAM Quota | RAM Used | Disk Used |
--------------------------------------------------------------------------------------------------------------------------------------------------+
r4dK9wYeZBZ%iqABnPUVxh6g5r2Sj4CAH8N89or9vhAsV3Sv3AK-22-897000 | couchbase | magma | 2 | none | 0 | 3000000 | 10485760000 | 2407566136 | 1992997674 |
--------------------------------------------------------------------------------------------------------------------------------------------------+
3. Do a graceful failover/delta recovery/rebalance of node 172.23.104.186.
2021-07-13 15:10:11,358 | test | INFO | MainThread | [collections_rebalance:rebalance_operation:696] failing over nodes [ip:172.23.104.186 port:8091 ssh_username:root]
|
2021-07-13 15:10:32,812 | test | INFO | pool-3-thread-5 | [rest_client:monitorRebalance:1483] Rebalance done. Taken 21.2660000324 seconds to complete
|
2021-07-13 15:15:16,674 | test | INFO | pool-3-thread-2 | [task:check:414] Rebalance completed with progress: 100% in 62.2669999599 sec
|
4. Do a graceful failover/delta recovery/rebalance of another node 172.23.120.201.
2021-07-13 15:15:16,676 | test | INFO | MainThread | [collections_rebalance:rebalance_operation:696] failing over nodes [ip:172.23.120.201 port:8091 ssh_username:root]
|
2021-07-13 15:15:38,038 | test | INFO | pool-3-thread-24 | [rest_client:monitorRebalance:1483] Rebalance done. Taken 21.1840000153 seconds to complete
|
This rebalance fails as shown below.
2021-07-13 15:21:03,512 | test | ERROR | pool-3-thread-18 | [rest_client:_rebalance_status_and_progress:1547] {u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try again.', u'type': u'rebalance', u'masterRequestTimedOut': False, u'statusId': u'151552f47e4c4a05b5d47463a51418e7', u'statusIsStale': False, u'lastReportURI': u'/logs/rebalanceReport?reportID=23826d0fd459189b7a9d51bd2bdcbdf4', u'status': u'notRunning'} - rebalance failed
|
2021-07-13 15:21:03,584 | test | INFO | pool-3-thread-18 | [rest_client:print_UI_logs:2688] Latest logs from UI on 172.23.98.196:
|
2021-07-13 15:21:03,585 | test | ERROR | pool-3-thread-18 | [rest_client:print_UI_logs:2690] {u'code': 0, u'module': u'ns_orchestrator', u'type': u'critical', u'node': u'ns_1@172.23.98.196', u'tstamp': 1626214862301L, u'shortText': u'message', u'serverTime': u'2021-07-13T15:21:02.301Z', u'text': u'Rebalance exited with reason {prepare_delta_recovery_failed,\n "r4dK9wYeZBZ%iqABnPUVxh6g5r2Sj4CAH8N89or9vhAsV3Sv3AK-22-897000",\n {error,\n {failed_nodes,\n [{\'ns_1@172.23.120.201\',{error,aborted}}]}}}.\nRebalance Operation Id = 7737f8e895d77ae0df92f798cf6024d5'}
|
2021-07-13 15:21:03,585 | test | ERROR | pool-3-thread-18 | [rest_client:print_UI_logs:2690] {u'code': 0, u'module': u'ns_orchestrator', u'type': u'info', u'node': u'ns_1@172.23.98.196', u'tstamp': 1626214741965L, u'shortText': u'message', u'serverTime': u'2021-07-13T15:19:01.965Z', u'text': u"Starting rebalance, KeepNodes = ['ns_1@172.23.98.196','ns_1@172.23.98.195',\n 'ns_1@172.23.104.186','ns_1@172.23.120.201',\n 'ns_1@172.23.120.206'], EjectNodes = [], Failed over and being ejected nodes = [], Delta recovery nodes = ['ns_1@172.23.120.201'], Delta recovery buckets = all; Operation Id = 7737f8e895d77ae0df92f798cf6024d5"}
|
2021-07-13 15:21:03,585 | test | ERROR | pool-3-thread-18 | [rest_client:print_UI_logs:2690] {u'code': 0, u'module': u'ns_orchestrator', u'type': u'info', u'node': u'ns_1@172.23.98.196', u'tstamp': 1626214536289L, u'shortText': u'message', u'serverTime': u'2021-07-13T15:15:36.289Z', u'text': u'Graceful failover completed successfully.\nRebalance Operation Id = e4fae68cab8c603c2e17630ae69492cb'}
|
2021-07-13 15:21:03,585 | test | ERROR | pool-3-thread-18 | [rest_client:print_UI_logs:2690] {u'code': 0, u'module': u'failover', u'type': u'info', u'node': u'ns_1@172.23.98.196', u'tstamp': 1626214536037L, u'shortText': u'message', u'serverTime': u'2021-07-13T15:15:36.037Z', u'text': u"Deactivating failed over nodes ['ns_1@172.23.120.201']"}
|
2021-07-13 15:21:03,586 | test | ERROR | pool-3-thread-18 | [rest_client:print_UI_logs:2690] {u'code': 0, u'module': u'failover', u'type': u'info', u'node': u'ns_1@172.23.98.196', u'tstamp': 1626214536017L, u'shortText': u'message', u'serverTime': u'2021-07-13T15:15:36.017Z', u'text': u"Failed over ['ns_1@172.23.120.201']: ok"}
|
2021-07-13 15:21:03,586 | test | ERROR | pool-3-thread-18 | [rest_client:print_UI_logs:2690] {u'code': 0, u'module': u'ns_memcached', u'type': u'info', u'node': u'ns_1@172.23.120.201', u'tstamp': 1626214535972L, u'shortText': u'message', u'serverTime': u'2021-07-13T15:15:35.972Z', u'text': u'Shutting down bucket "r4dK9wYeZBZ%iqABnPUVxh6g5r2Sj4CAH8N89or9vhAsV3Sv3AK-22-897000" on \'ns_1@172.23.120.201\' for deletion'}
|
2021-07-13 15:21:03,586 | test | ERROR | pool-3-thread-18 | [rest_client:print_UI_logs:2690] {u'code': 0, u'module': u'failover', u'type': u'info', u'node': u'ns_1@172.23.98.196', u'tstamp': 1626214535699L, u'shortText': u'message', u'serverTime': u'2021-07-13T15:15:35.699Z', u'text': u"Starting failing over ['ns_1@172.23.120.201']"}
|
2021-07-13 15:21:03,588 | test | ERROR | pool-3-thread-18 | [rest_client:print_UI_logs:2690] {u'code': 0, u'module': u'ns_vbucket_mover', u'type': u'info', u'node': u'ns_1@172.23.98.196', u'tstamp': 1626214517078L, u'shortText': u'message', u'serverTime': u'2021-07-13T15:15:17.078Z', u'text': u'Bucket "r4dK9wYeZBZ%iqABnPUVxh6g5r2Sj4CAH8N89or9vhAsV3Sv3AK-22-897000" rebalance does not seem to be swap rebalance'}
|
2021-07-13 15:21:03,588 | test | ERROR | pool-3-thread-18 | [rest_client:print_UI_logs:2690] {u'code': 0, u'module': u'ns_rebalancer', u'type': u'info', u'node': u'ns_1@172.23.98.196', u'tstamp': 1626214516793L, u'shortText': u'message', u'serverTime': u'2021-07-13T15:15:16.793Z', u'text': u"Starting vbucket moves for graceful failover of ['ns_1@172.23.120.201']"}
|
2021-07-13 15:21:03,588 | test | ERROR | pool-3-thread-18 | [rest_client:print_UI_logs:2690] {u'code': 0, u'module': u'ns_orchestrator', u'type': u'info', u'node': u'ns_1@172.23.98.196', u'tstamp': 1626214516793L, u'shortText': u'message', u'serverTime': u'2021-07-13T15:15:16.793Z', u'text': u"Starting graceful failover of nodes ['ns_1@172.23.120.201']. Operation Id = e4fae68cab8c603c2e17630ae69492cb"}
|
2021-07-13 15:21:03,588 | test | ERROR | pool-3-thread-18 | [task:call:267] Rebalance Failed: {u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try again.', u'type': u'rebalance', u'masterRequestTimedOut': False, u'statusId': u'151552f47e4c4a05b5d47463a51418e7', u'statusIsStale': False, u'lastReportURI': u'/logs/rebalanceReport?reportID=23826d0fd459189b7a9d51bd2bdcbdf4', u'status': u'notRunning'} - rebalance failed
|
Some interesting things to note.
1. This is fairly easily reproducible but not on every single run. At least 3/10 times we can hit this from 10-15 runs I have had with each run having 15-20 tests.
2. This is only seen on the second graceful failover/delta recovery/rebalance(step 4) and when it does on first graceful failover/delta recovery/rebalance(step 3) it fails like MB-47087.
cbcollect_info attached.
Attachments
Issue Links
For Gerrit Dashboard: MB-47387 | ||||||
---|---|---|---|---|---|---|
# | Subject | Branch | Project | Status | CR | V |
158683,3 | MB-47387 Increase magma bucket creation timeout | master | ns_server | Status: MERGED | +2 | +1 |
161551,2 | Revert "MB-47387 Increase magma bucket creation timeout" | master | ns_server | Status: ABANDONED | -2 | +1 |
163886,2 | MB-47387: Make ep_warmup_thread indicate all threads finished | master | kv_engine | Status: MERGED | +2 | +1 |
165417,4 | MB-47387 Increase timeout for magma | master | ns_server | Status: MERGED | +2 | +1 |
165964,3 | MB-47387 Increase ns_server->memcached timeout | master | ns_server | Status: MERGED | +2 | +1 |
168522,4 | MB-47387 lsm: Use all thread pools to open tables on recovery | master | magma | Status: MERGED | +2 | +1 |
168602,6 | MB-47387 lsm: Fix tracking of io stats during sstable open | master | magma | Status: MERGED | +2 | +1 |
168603,5 | MB-47387 magma: Enhance db recovery log message | master | magma | Status: MERGED | +2 | +1 |