Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-34847

Swap rebalance failed in high bucket density test

    XMLWordPrintable

Details

    • Untriaged
    • Yes

    Description

      Build 6.5.0-3633

      Observed that swap second swap rebalance for kv has failed while running high bucket density(30 bucket) test.

      Node 172.23.97.14 was coming in and 172.23.97.15 was going out.

      Job- http://perf.jenkins.couchbase.com/job/arke-multi-bucket/310/

      Logs-

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-310/172.23.97.14.zip 
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-310/172.23.97.15.zip

      In logs we see error-

      Node ('ns_1@172.23.97.14') was automatically failed over. Reason: The data service is online but the following buckets' data are not accessible: bucket-19.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            Closing duplicate bugs

            arunkumar Arunkumar Senthilnathan added a comment - Closing duplicate bugs

            Tried with auto-failover timeout of 10 sec and this failure is not observed.

            Job- http://perf.jenkins.couchbase.com/job/arke-multi-bucket/318

            Also we do not see high cpu utilisation - http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=arke_basic_650-4059_run_kv_rebalance_a906#487de8ca215c7d1aa3798ec6ce191f4a

            mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - - edited Tried with auto-failover timeout of 10 sec and this failure is not observed. Job-  http://perf.jenkins.couchbase.com/job/arke-multi-bucket/318 Also we do not see high cpu utilisation -  http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=arke_basic_650-4059_run_kv_rebalance_a906#487de8ca215c7d1aa3798ec6ce191f4a

            Mahesh Mandhare: Also I noticed,

            [WARNING] Unexpected exception in NSServerSystem: 'cpu_utilization_rate'

            Not sure if the cpu utilization is high on the orchestrator and/or node being added.

             

            On another unrelated note, auto-failover of nodes being added can be triggered during bucket warmup phase as we can't process/send the heartbeats in time. This behavior is very environment dependent(cpu utilization, slow memcached warmup, network delay, etc) and especially possible when the auto-failover timeout is kept to 5sec which is the lowest possible. 

            Please also retry with higher auto-failover timeout, say 10 seconds. 

            Abhijeeth.Nuthan Abhijeeth Nuthan added a comment - Mahesh Mandhare : Also I noticed, [WARNING] Unexpected exception in NSServerSystem: 'cpu_utilization_rate' Not sure if the cpu utilization is high on the orchestrator and/or node being added.   On another unrelated note, auto-failover of nodes being added can be triggered during bucket warmup phase as we can't process/send the heartbeats in time. This behavior is very environment dependent(cpu utilization, slow memcached warmup, network delay, etc) and especially possible when the auto-failover timeout is kept to 5sec which is the lowest possible.  Please also retry with higher auto-failover timeout, say 10 seconds. 

            Mahesh Mandhare: It seems as though the cbcollect_info is timing out , as a result I do not have the orchestrator logs which is required to debug. Could you rerun it with a greater timeout value (https://github.com/couchbase/perfrunner/blob/master/perfrunner/remote/linux.py#L145). 

            04:49:27 2019-08-07T04:49:27 [INFO] Reading configuration file: clusters/arke_themis_brqx.spec
            04:49:27 2019-08-07T04:49:27 [INFO] Detecting OS
            04:49:27 2019-08-07T04:49:27 [INFO] Running cbcollect_info with redaction
            04:49:27 2019-08-07T04:49:27 [INFO] Running cbcollect_info with redaction
            04:49:27 2019-08-07T04:49:27 [INFO] Running cbcollect_info with redaction
            04:49:27 2019-08-07T04:49:27 [INFO] Running cbcollect_info with redaction
            04:49:27 2019-08-07T04:49:27 [INFO] Running cbcollect_info with redaction
            04:49:27 2019-08-07T04:49:27 [INFO] Running cbcollect_info with redaction
            04:49:27 2019-08-07T04:49:27 [INFO] Running cbcollect_info with redaction
            04:49:27 2019-08-07T04:49:27 [INFO] Running cbcollect_info with redaction
            04:49:27 2019-08-07T04:49:27 [INFO] Running cbcollect_info with redaction
            04:49:27 2019-08-07T04:49:27 [INFO] Running cbcollect_info with redaction
            04:49:27 2019-08-07T04:49:27 [INFO] Running cbcollect_info with redaction
            05:09:27 2019-08-07T05:09:27 [ERROR] cbcollect_info timed out
            05:09:27 2019-08-07T05:09:27 [ERROR] cbcollect_info timed out
            05:09:27 2019-08-07T05:09:27 [ERROR] cbcollect_info timed out
            05:09:27 2019-08-07T05:09:27 [ERROR] cbcollect_info timed out
            05:09:28 + scripts/upload_info.sh
            05:09:33 https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-314/172.23.96.20.zip
            05:09:42 https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-314/172.23.96.23.zip
            05:09:49 https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-314/172.23.97.14.zip
            05:09:53 https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-314/172.23.97.15.zip
            05:09:57 https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-314/172.23.97.177.zip
            05:10:02 https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-314/172.23.97.19.zip
            05:10:06 https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-314/172.23.97.20.zip
            05:10:08 https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-314/tools.zip 

             

            Abhijeeth.Nuthan Abhijeeth Nuthan added a comment - Mahesh Mandhare : It seems as though the cbcollect_info is timing out , as a result I do not have the orchestrator logs which is required to debug. Could you rerun it with a greater timeout value ( https://github.com/couchbase/perfrunner/blob/master/perfrunner/remote/linux.py#L145 ).  04 : 49 : 27 2019 - 08 -07T04: 49 : 27 [INFO] Reading configuration file: clusters/arke_themis_brqx.spec 04 : 49 : 27 2019 - 08 -07T04: 49 : 27 [INFO] Detecting OS 04 : 49 : 27 2019 - 08 -07T04: 49 : 27 [INFO] Running cbcollect_info with redaction 04 : 49 : 27 2019 - 08 -07T04: 49 : 27 [INFO] Running cbcollect_info with redaction 04 : 49 : 27 2019 - 08 -07T04: 49 : 27 [INFO] Running cbcollect_info with redaction 04 : 49 : 27 2019 - 08 -07T04: 49 : 27 [INFO] Running cbcollect_info with redaction 04 : 49 : 27 2019 - 08 -07T04: 49 : 27 [INFO] Running cbcollect_info with redaction 04 : 49 : 27 2019 - 08 -07T04: 49 : 27 [INFO] Running cbcollect_info with redaction 04 : 49 : 27 2019 - 08 -07T04: 49 : 27 [INFO] Running cbcollect_info with redaction 04 : 49 : 27 2019 - 08 -07T04: 49 : 27 [INFO] Running cbcollect_info with redaction 04 : 49 : 27 2019 - 08 -07T04: 49 : 27 [INFO] Running cbcollect_info with redaction 04 : 49 : 27 2019 - 08 -07T04: 49 : 27 [INFO] Running cbcollect_info with redaction 04 : 49 : 27 2019 - 08 -07T04: 49 : 27 [INFO] Running cbcollect_info with redaction 05 : 09 : 27 2019 - 08 -07T05: 09 : 27 [ERROR] cbcollect_info timed out 05 : 09 : 27 2019 - 08 -07T05: 09 : 27 [ERROR] cbcollect_info timed out 05 : 09 : 27 2019 - 08 -07T05: 09 : 27 [ERROR] cbcollect_info timed out 05 : 09 : 27 2019 - 08 -07T05: 09 : 27 [ERROR] cbcollect_info timed out 05 : 09 : 28 + scripts/upload_info.sh 05 : 09 : 33 https: //s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-314/172.23.96.20.zip 05 : 09 : 42 https: //s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-314/172.23.96.23.zip 05 : 09 : 49 https: //s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-314/172.23.97.14.zip 05 : 09 : 53 https: //s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-314/172.23.97.15.zip 05 : 09 : 57 https: //s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-314/172.23.97.177.zip 05 : 10 : 02 https: //s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-314/172.23.97.19.zip 05 : 10 : 06 https: //s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-314/172.23.97.20.zip 05 : 10 : 08 https: //s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-arke-multi-bucket-314/tools.zip  

            People

              artem Artem Stemkovski
              mahesh.mandhare Mahesh Mandhare (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty