Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-51670

[System Test] Rebalance taking a long time in the Eventing phase

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • Yes

    Description

      Build : 7.1.0-2543
      Test : -test tests/integration/neo/test_neo_couchstore_milestone4.yml -scope tests/integration/neo/scope_couchstore.yml
      Iteration : 1st and 2nd
      Scale : 3

      In the first iteration, there was a rebalance operation to perform a hard failover, full recovery and add back on a KV node 172.23.105.107. This rebalance operation took 12+ hrs to complete. As seen in the rebalance report (rebalance_report_20220403T014416.json attached):

      "eventing" : {
               "completedTime" : "2022-04-02T18:44:16.254-07:00",
               "perNodeProgress" : {
                  "ns_1@172.23.104.67" : 1,
                  "ns_1@172.23.120.107" : 1,
                  "ns_1@172.23.96.192" : 1
               },
               "startTime" : "2022-04-02T06:36:55.090-07:00",
               "timeTaken" : 43641164,
               "totalProgress" : 100
            }
      

      Logs covering this occurrence :

      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.104.137.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.104.155.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.104.157.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.104.5.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.104.67.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.104.69.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.104.70.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.105.107.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.105.111.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.105.168.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.106.100.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.106.188.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.108.103.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.120.107.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.120.245.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.121.117.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.123.28.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.96.148.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.96.192.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.96.251.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.96.252.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.96.253.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.97.119.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.97.121.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.97.122.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.97.239.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.99.20.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.99.21.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1648957940/collectinfo-2022-04-03T035222-ns_1%40172.23.99.25.zip
      

      In the 2nd iteration, there is a rebalance operation currently in progress. There are 3 nodes (kv, index, query) which were simultaneously autofailed over (multi-node failover), and are consequently rebalanced in. This rebalance operation is ongoing for the last 5+ hrs. The Eventing phase is taking long here too (4+ hrs right now).

      [2022-04-03T08:31:08-07:00, sequoiatools/cbutil:667091] /cbinit.py 172.23.106.100 root couchbase stop
      [2022-04-03T08:31:28-07:00, sequoiatools/cbutil:7144a4] /cbinit.py 172.23.123.28 root couchbase stop
      [2022-04-03T08:31:38-07:00, sequoiatools/cbutil:085d27] /cbinit.py 172.23.104.137 root couchbase stop
      [2022-04-03T08:31:44-07:00, sequoiatools/cmd:70243a] 10
      [2022-04-03T08:32:00-07:00, sequoiatools/couchbase-cli:7.1:d32689] rebalance -c 172.23.108.103:8091 -u Administrator -p password
      [2022-04-03T08:59:30-07:00, sequoiatools/cmd:d26d5a] 60
      [2022-04-03T09:00:36-07:00, sequoiatools/cmd:4c4e4c] 60
      [2022-04-03T09:01:42-07:00, sequoiatools/cbutil:6f46a5] /cbinit.py 172.23.106.100,172.23.123.28,172.23.104.137 root couchbase start
      [2022-04-03T09:01:49-07:00, sequoiatools/cmd:df307e] 120
      [2022-04-03T09:03:55-07:00, sequoiatools/couchbase-cli:7.1:3d1e9a] server-add -c 172.23.108.103:8091 --server-add https://172.23.106.100 -u Administrator -p password --server-add-username Administrator --server-add-password password --services data
      [2022-04-03T09:04:12-07:00, sequoiatools/couchbase-cli:7.1:ca0529] server-add -c 172.23.108.103:8091 --server-add https://172.23.123.28 -u Administrator -p password --server-add-username Administrator --server-add-password password --services index
      [2022-04-03T09:04:25-07:00, sequoiatools/couchbase-cli:7.1:be3732] server-add -c 172.23.108.103:8091 --server-add https://172.23.104.137 -u Administrator -p password --server-add-username Administrator --server-add-password password --services query
       
      Error occurred on container - sequoiatools/couchbase-cli:7.1:[server-add -c 172.23.108.103:8091 --server-add https://172.23.104.137 -u Administrator -p password --server-add-username Administrator --server-add-password password --services query]
       
      docker logs be3732
      docker start be3732
       
      =ERROR: Prepare join failed. Node is already part of cluster.
      [2022-04-03T09:04:32-07:00, sequoiatools/couchbase-cli:7.1:787fbc] rebalance -c 172.23.108.103:8091 -u Administrator -p password
      

      The following set of logs were collected after around 1 hr of rebalance start. Eventing nodes are : 172.23.104.5, 172.23.104.67, 172.23.96.192

       url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.104.137.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.104.155.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.104.5.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.104.67.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.104.69.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.104.70.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.105.107.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.105.111.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.105.168.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.106.100.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.106.188.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.108.103.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.120.107.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.120.245.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.121.117.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.123.28.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.96.148.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.96.192.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.96.251.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.96.252.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.96.253.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.97.119.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.97.121.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.97.122.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.99.11.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.99.20.zip
               url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1649006126/collectinfo-2022-04-03T171529-ns_1%40172.23.99.25.zip

      This is a regression since RC3 since this issue was never seen earlier in any of the builds in the longevity test.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            sujay.gad Sujay Gad
            mihir.kamdar Mihir Kamdar (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty