Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-45769

Rebalance repeatedly fails during upgrade with Rebalance exited with reason {pre_rebalance_janitor_run_failed,"DISTRICT", {error, {config_sync_failed,push,

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • No

    Description

      Steps to Repro
      It is an essentially an upgrade of the system test cluster.

      1. Start a 6.6.2 system test longevity run.
      2. It has following cluster setup

      • * 9 data nodes
      • * 3 analytics nodes
      • * 3 eventing nodes
      • * 4 indexing nodes
      • * 3 search nodes
      • * 3 query nodes

      3. It has 10 buckets, fts indexes, analytics datasets, 2i indexes, eventing functions.
      4. We do a swap rebalance of 6 node(1 data, 1 index, 1 analytics, 1 fts, 1 query, 1 eventing) which has 6.6.2-9588 with 7.0.0-4979. This woks fine.
      5. Failover one fts node 6.6.2-9588 - 172.23.106.207
      6. Failover one n1ql node 6.6.2-9588 - 172.23.106.191
      7. Now try to graceful failover one 6.6.2-9588 - 172.23.105.90
      8. Now I hit into MB-45767.
      9. To proceed with the upgrade of the cluster at this point I do multi node hard failover of the following nodes.

      172.23.105.90
      172.23.105.62
      172.23.105.118
      172.23.105.25
      

      10. Run the following command on all the nodes (172.23.105.90,172.23.105.62,172.23.105.118,172.23.105.25,172.23.106.207,172.23.106.191).

      systemctl stop couchbase-server
      rpm -U http://172.23.126.166/builds/latestbuilds/couchbase-server/cheshire-cat/4979/couchbase-server-enterprise-7.0.0-4979-centos7.x86_64.rpm
      

      Now I recover all the nodes and do a rebalance. Apart from the node 172.23.105.90 which is a kv node rebalance works for all the other nodes. I retried rebalance multiple times in the hope that I can continue upgrading the cluster. Unfortunately all the rebalances failed with the following error. See rebalanceReport (1).json

      172.23.104.244 - 8:32:33 AM 19 Apr, 2021

      Rebalance exited with reason {pre_rebalance_janitor_run_failed,"DISTRICT",
      {error,
      {config_sync_failed,push,
      {error,[{'ns_1@172.23.106.225',timeout}]}}}}.
      Rebalance Operation Id = 3c8c387d7a88daf60ffe335be82d46c4
      

      It would be good to have a work around this so that I can continue to upgrade the cluster.
      cbcollect_info attached. See also MB-45646 and MB-45767.

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-45769
          # Subject Branch Project Status CR V

          Activity

            People

              artem Artem Stemkovski
              Balakumaran.Gopal Balakumaran Gopal
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty