Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-51670

[System Test] Rebalance taking a long time in the Eventing phase

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • Yes

    Description

      Build : 7.1.0-2543
      Test : -test tests/integration/neo/test_neo_couchstore_milestone4.yml -scope tests/integration/neo/scope_couchstore.yml
      Iteration : 1st and 2nd
      Scale : 3

      In the first iteration, there was a rebalance operation to perform a hard failover, full recovery and add back on a KV node 172.23.105.107. This rebalance operation took 12+ hrs to complete. As seen in the rebalance report (rebalance_report_20220403T014416.json attached):

      "eventing" : {
               "completedTime" : "2022-04-02T18:44:16.254-07:00",
               "perNodeProgress" : {
                  "ns_1@172.23.104.67" : 1,
                  "ns_1@172.23.120.107" : 1,
                  "ns_1@172.23.96.192" : 1
               },
               "startTime" : "2022-04-02T06:36:55.090-07:00",
               "timeTaken" : 43641164,
               "totalProgress" : 100
            }
      

      Logs covering this occurrence :

      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.104.137.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.104.155.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.104.157.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.104.5.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.104.67.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.104.69.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.104.70.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.105.107.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.105.111.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.105.168.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.106.100.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.106.188.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.108.103.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.120.107.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.120.245.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.121.117.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.123.28.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.96.148.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.96.192.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.96.251.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.96.252.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.96.253.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.97.119.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.97.121.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.97.122.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.97.239.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.99.20.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.99.21.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.99.25.zip
      

      In the 2nd iteration, there is a rebalance operation currently in progress. There are 3 nodes (kv, index, query) which were simultaneously autofailed over (multi-node failover), and are consequently rebalanced in. This rebalance operation is ongoing for the last 5+ hrs. The Eventing phase is taking long here too (4+ hrs right now).

      [2022-04-03T08:31:08-07:00, sequoiatools/cbutil:667091] /cbinit.py 172.23.106.100 root couchbase stop
      [2022-04-03T08:31:28-07:00, sequoiatools/cbutil:7144a4] /cbinit.py 172.23.123.28 root couchbase stop
      [2022-04-03T08:31:38-07:00, sequoiatools/cbutil:085d27] /cbinit.py 172.23.104.137 root couchbase stop
      [2022-04-03T08:31:44-07:00, sequoiatools/cmd:70243a] 10
      [2022-04-03T08:32:00-07:00, sequoiatools/couchbase-cli:7.1:d32689] rebalance -c 172.23.108.103:8091 -u Administrator -p password
      [2022-04-03T08:59:30-07:00, sequoiatools/cmd:d26d5a] 60
      [2022-04-03T09:00:36-07:00, sequoiatools/cmd:4c4e4c] 60
      [2022-04-03T09:01:42-07:00, sequoiatools/cbutil:6f46a5] /cbinit.py 172.23.106.100,172.23.123.28,172.23.104.137 root couchbase start
      [2022-04-03T09:01:49-07:00, sequoiatools/cmd:df307e] 120
      [2022-04-03T09:03:55-07:00, sequoiatools/couchbase-cli:7.1:3d1e9a] server-add -c 172.23.108.103:8091 --server-add https://172.23.106.100 -u Administrator -p password --server-add-username Administrator --server-add-password password --services data
      [2022-04-03T09:04:12-07:00, sequoiatools/couchbase-cli:7.1:ca0529] server-add -c 172.23.108.103:8091 --server-add https://172.23.123.28 -u Administrator -p password --server-add-username Administrator --server-add-password password --services index
      [2022-04-03T09:04:25-07:00, sequoiatools/couchbase-cli:7.1:be3732] server-add -c 172.23.108.103:8091 --server-add https://172.23.104.137 -u Administrator -p password --server-add-username Administrator --server-add-password password --services query
       
      Error occurred on container - sequoiatools/couchbase-cli:7.1:[server-add -c 172.23.108.103:8091 --server-add https://172.23.104.137 -u Administrator -p password --server-add-username Administrator --server-add-password password --services query]
       
      docker logs be3732
      docker start be3732
       
      =ERROR: Prepare join failed. Node is already part of cluster.
      [2022-04-03T09:04:32-07:00, sequoiatools/couchbase-cli:7.1:787fbc] rebalance -c 172.23.108.103:8091 -u Administrator -p password
      

      The following set of logs were collected after around 1 hr of rebalance start. Eventing nodes are : 172.23.104.5, 172.23.104.67, 172.23.96.192

       url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.104.137.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.104.155.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.104.5.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.104.67.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.104.69.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.104.70.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.105.107.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.105.111.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.105.168.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.106.100.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.106.188.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.108.103.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.120.107.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.120.245.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.121.117.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.123.28.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.96.148.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.96.192.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.96.251.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.96.252.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.96.253.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.97.119.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.97.121.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.97.122.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.99.11.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.99.20.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.99.25.zip

      This is a regression since RC3 since this issue was never seen earlier in any of the builds in the longevity test.

      Attachments

        For Gerrit Dashboard: MB-51670
        # Subject Branch Project Status CR V

        Activity

          People

            sujay.gad Sujay Gad
            mihir.kamdar Mihir Kamdar (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty