Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-62079

Rebalance failed during CPU stress test

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • Morpheus, 7.6.2
    • 7.6.2
    • secondary-index
    • None
    • Untriaged
    • 0
    • Yes

    Description

      Build - 7.6.2-3674

      Steps 

      • Cluster config - kv:n1ql-kv:n1ql-index-index-index-index-index
      • Create bucket and one named keyspace and load docs and create indexes in all the keyspaces
      • Keep running index scans in the background
      • Load docs until indexer resident ratio reaches 20%
      • Fill the disk upto 80% capacity on all indexer nodes. Below cmd is used 

        dd if=/dev/mapper/tmpl--deb10--vg-root of=/opt/couchbase/var/lib/couchbase/data/DUMMY_FILE_DELETE_IF_STILL_PRESENT bs=1M  

         

      • Rebalance out 2 indexer nodes
      • During the rebalance a CPU and memory stress is induced on all the nodes in the cluster using the below command

      stress --cpu 1 --vm-bytes 365M --vm 1 --timeout 1800 -d 1 & > /dev/null && echo 1 || echo 0 

      • The ongoing rebalance fails with the below error

      {'status': 'none', 'errorMessage': 'Rebalance failed. See logs for detailed reason. You can try again.'} - rebalance failed
      [2024-05-25 13:16:27,573] - [on_prem_rest_client:4324] INFO - Latest logs from UI on 172.23.123.48:
      [2024-05-25 13:16:27,573] - [on_prem_rest_client:4325] ERROR - {'node': 'ns_1@172.23.122.61', 'type': 'info', 'code': 0, 'module': 'menelaus_web_alerts_srv', 'tstamp': 1716668185410, 'shortText': 'message', 'text': "The time on node 'ns_1@172.23.122.61' is not synchronized. Please ensure that NTP is set up correctly on all nodes and that clocks are synchronized.", 'serverTime': '2024-05-25T13:16:25.410Z'}
      [2024-05-25 13:16:27,573] - [on_prem_rest_client:4325] ERROR - {'node': 'ns_1@172.23.120.101', 'type': 'critical', 'code': 0, 'module': 'ns_orchestrator', 'tstamp': 1716668180054, 'shortText': 'message', 'text': 'Rebalance exited with reason {{badmatch,\n                               {leader_activities_error,\n                                {default,rebalance},\n                                {quorum_lost,\n                                 {lease_lost,\'ns_1@172.23.121.135\'}}}},\n                              [{ns_rebalancer,rebalance,7,\n                                [{file,"src/ns_rebalancer.erl"},{line,456}]},\n                               {proc_lib,init_p_do_apply,3,\n                                [{file,"proc_lib.erl"},{line,240}]}]}.\nRebalance Operation Id = 4c14e220ff46693203c2da33c8b8697d', 'serverTime': '2024-05-25T13:16:20.054Z'}
      [2024-05-25 13:16:27,574] - [on_prem_rest_client:4325] ERROR - {'node': 'ns_1@172.23.121.160', 'type': 'info', 'code': 0, 'module': 'menelaus_web_alerts_srv', 'tstamp': 1716668175242, 'shortText': 'message', 'text': 'Warning: approaching low index resident percentage. Indexer RAM percentage on node "172.23.121.160" is 9%, which is under the threshold of 10%.', 'serverTime': '2024-05-25T13:16:15.242Z'}
      [2024-05-25 13:16:27,574] - [on_prem_rest_client:4325] ERROR - {'node': 'ns_1@172.23.121.160', 'type': 'info', 'code': 0, 'module': 'menelaus_web_alerts_srv', 'tstamp': 1716668175241, 'shortText': 'message', 'text': "The time on node 'ns_1@172.23.121.160' is not synchronized. Please ensure that NTP is set up correctly on all nodes and that clocks are synchronized.", 'serverTime': '2024-05-25T13:16:15.241Z'}
      [2024-05-25 13:16:27,574] - [on_prem_rest_client:4325] ERROR - {'node': 'ns_1@172.23.120.101', 'type': 'info', 'code': 0, 'module': 'ns_vbucket_mover', 'tstamp': 1716668174184, 'shortText': 'message', 'text': 'Bucket "test_bucket" rebalance appears to be swap rebalance', 'serverTime': '2024-05-25T13:16:14.184Z'}
      [2024-05-25 13:16:27,574] - [on_prem_rest_client:4325] ERROR - {'node': 'ns_1@172.23.120.101', 'type': 'info', 'code': 0, 'module': 'menelaus_web_alerts_srv', 'tstamp': 1716668173811, 'shortText': 'message', 'text': 'Warning: approaching low index resident percentage. Indexer RAM percentage on node "172.23.120.101" is 0%, which is under the threshold of 10%.', 'serverTime': '2024-05-25T13:16:13.811Z'}
      [2024-05-25 13:16:27,574] - [on_prem_rest_client:4325] ERROR - {'node': 'ns_1@172.23.122.123', 'type': 'info', 'code': 0, 'module': 'menelaus_web_alerts_srv', 'tstamp': 1716668160746, 'shortText': 'message', 'text': "The time on node 'ns_1@172.23.122.123' is not synchronized. Please ensure that NTP is set up correctly on all nodes and that clocks are synchronized.", 'serverTime': '2024-05-25T13:16:00.746Z'}
      [2024-05-25 13:16:27,574] - [on_prem_rest_client:4325] ERROR - {'node': 'ns_1@172.23.121.66', 'type': 'info', 'code': 0, 'module': 'menelaus_web_alerts_srv', 'tstamp': 1716668150283, 'shortText': 'message', 'text': 'Warning: approaching low index resident percentage. Indexer RAM percentage on node "172.23.121.66" is 4%, which is under the threshold of 10%.', 'serverTime': '2024-05-25T13:15:50.283Z'}
      [2024-05-25 13:16:27,574] - [on_prem_rest_client:4325] ERROR - {'node': 'ns_1@172.23.121.66', 'type': 'info', 'code': 0, 'module': 'menelaus_web_alerts_srv', 'tstamp': 1716668150282, 'shortText': 'message', 'text': "The time on node 'ns_1@172.23.121.66' is not synchronized. Please ensure that NTP is set up correctly on all nodes and that clocks are synchronized.", 'serverTime': '2024-05-25T13:15:50.282Z'}
      [2024-05-25 13:16:27,574] - [on_prem_rest_client:4325] ERROR - {'node': 'ns_1@172.23.120.101', 'type': 'info', 'code': 0, 'module': 'ns_rebalancer', 'tstamp': 1716668149423, 'shortText': 'message', 'text': 'Started rebalancing bucket test_bucket', 'serverTime': '2024-05-25T13:15:49.423Z'}
      [2024-05-25 13:16:27,576] - [remote_util:306] INFO - SSH Connecting to 172.23.120.101 with username:root, attempt#1 of 5
      [2024-05-25 13:16:27,848] - [remote_util:344] INFO - SSH Connected to 172.23.120.101 as root
      [2024-05-25 13:16:27,988] - [remote_util:3520] INFO - os_distro: Ubuntu, os_version: debian 10, is_linux_distro: True
      [2024-05-25 13:16:28,284] - [remote_util:3690] INFO - extract_remote_info-->distribution_type: Ubuntu, distribution_version: debian 10
      [2024-05-25 13:16:28,285] - [remote_util:3356] INFO - running command.raw on 172.23.120.101: rm -f /opt/couchbase/var/lib/couchbase/data/DUMMY_FILE_DELETE_IF_STILL_PRESENT
      [2024-05-25 13:16:31,179] - [on_prem_rest_client:2078] ERROR - {'status': 'none', 'errorMessage': 'Rebalance failed. See logs for detailed reason. You can try again.'} - rebalance failed
      [2024-05-25 13:16:31,192] - [on_prem_rest_client:4324] INFO - Latest logs from UI on 172.23.123.48:
      [2024-05-25 13:16:31,192] - [on_prem_rest_client:4325] ERROR - {'node': 'ns_1@172.23.122.61', 'type': 'info', 'code': 0, 'module': 'menelaus_web_alerts_srv', 'tstamp': 1716668185410, 'shortText': 'message', 'text': "The time on node 'ns_1@172.23.122.61' is not synchronized. Please ensure that NTP is set up correctly on all nodes and that clocks are synchronized.", 'serverTime': '2024-05-25T13:16:25.410Z'}
      [2024-05-25 13:16:31,192] - [on_prem_rest_client:4325] ERROR - {'node': 'ns_1@172.23.120.101', 'type': 'critical', 'code': 0, 'module': 'ns_orchestrator', 'tstamp': 1716668180054, 'shortText': 'message', 'text': 'Rebalance exited with reason {{badmatch,\n                               {leader_activities_error,\n                                {default,rebalance},\n                                {quorum_lost,\n                                 {lease_lost,\'ns_1@172.23.121.135\'}}}},\n                              [{ns_rebalancer,rebalance,7,\n                                [{file,"src/ns_rebalancer.erl"},{line,456}]},\n                               {proc_lib,init_p_do_apply,3,\n                                [{file,"proc_lib.erl"},{line,240}]}]}.\nRebalance Operation Id = 4c14e220ff46693203c2da33c8b8697d', 'serverTime': '2024-05-25T13:16:20.054Z'}
      [2024-05-25 13:16:31,192] - [on_prem_rest_client:4325] ERROR - {'node': 'ns_1@172.23.121.160', 'type': 'info', 'code': 0, 'module': 'menelaus_web_alerts_srv', 'tstamp': 1716668175242, 'shortText': 'message', 'text': 'Warning: approaching low index resident percentage. Indexer RAM percentage on node "172.23.121.160" is 9%, which is under the threshold of 10%.', 'serverTime': '2024-05-25T13:16:15.242Z'}
      [2024-05-25 13:16:31,192] - [on_prem_rest_client:4325] ERROR - {'node': 'ns_1@172.23.121.160', 'type': 'info', 'code': 0, 'module': 'menelaus_web_alerts_srv', 'tstamp': 1716668175241, 'shortText': 'message', 'text': "The time on node 'ns_1@172.23.121.160' is not synchronized. Please ensure that NTP is set up correctly on all nodes and that clocks are synchronized.", 'serverTime': '2024-05-25T13:16:15.241Z'}
      [2024-05-25 13:16:31,193] - [on_prem_rest_client:4325] ERROR - {'node': 'ns_1@172.23.120.101', 'type': 'info', 'code': 0, 'module': 'ns_vbucket_mover', 'tstamp': 1716668174184, 'shortText': 'message', 'text': 'Bucket "test_bucket" rebalance appears to be swap rebalance', 'serverTime': '2024-05-25T13:16:14.184Z'}
      [2024-05-25 13:16:31,193] - [on_prem_rest_client:4325] ERROR - {'node': 'ns_1@172.23.120.101', 'type': 'info', 'code': 0, 'module': 'menelaus_web_alerts_srv', 'tstamp': 1716668173811, 'shortText': 'message', 'text': 'Warning: approaching low index resident percentage. Indexer RAM percentage on node "172.23.120.101" is 0%, which is under the threshold of 10%.', 'serverTime': '2024-05-25T13:16:13.811Z'}
      [2024-05-25 13:16:31,193] - [on_prem_rest_client:4325] ERROR - {'node': 'ns_1@172.23.122.123', 'type': 'info', 'code': 0, 'module': 'menelaus_web_alerts_srv', 'tstamp': 1716668160746, 'shortText': 'message', 'text': "The time on node 'ns_1@172.23.122.123' is not synchronized. Please ensure that NTP is set up correctly on all nodes and that clocks are synchronized.", 'serverTime': '2024-05-25T13:16:00.746Z'}
      [2024-05-25 13:16:31,193] - [on_prem_rest_client:4325] ERROR - {'node': 'ns_1@172.23.121.66', 'type': 'info', 'code': 0, 'module': 'menelaus_web_alerts_srv', 'tstamp': 1716668150283, 'shortText': 'message', 'text': 'Warning: approaching low index resident percentage. Indexer RAM percentage on node "172.23.121.66" is 4%, which is under the threshold of 10%.', 'serverTime': '2024-05-25T13:15:50.283Z'}
      [2024-05-25 13:16:31,193] - [on_prem_rest_client:4325] ERROR - {'node': 'ns_1@172.23.121.66', 'type': 'info', 'code': 0, 'module': 'menelaus_web_alerts_srv', 'tstamp': 1716668150282, 'shortText': 'message', 'text': "The time on node 'ns_1@172.23.121.66' is not synchronized. Please ensure that NTP is set up correctly on all nodes and that clocks are synchronized.", 'serverTime': '2024-05-25T13:15:50.282Z'}
      [2024-05-25 13:16:31,193] - [on_prem_rest_client:4325] ERROR - {'node': 'ns_1@172.23.120.101', 'type': 'info', 'code': 0, 'module': 'ns_rebalancer', 'tstamp': 1716668149423, 'shortText': 'message', 'text': 'Started rebalancing bucket test_bucket', 'serverTime': '2024-05-25T13:15:49.423Z'} 

      Logs 

      test_2 (12).zip

      Let me know if i can tweak the load for disk filling 

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            yash.dodderi Yash Dodderi
            yash.dodderi Yash Dodderi
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty