Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-58284

Indexer rebalance stuck for more than 17 hours

    XMLWordPrintable

Details

    Description

      Test that led to failure:

      1. Create a 7 node cluster(c5.2xlarge), KV-3, GSI-2, N1QL-2
      2. Create bucket, 10 collections, 100M items in each.
      3. Create GSI indexes on 2 collections.
      4. Start KV Read+expiry load at 10k ops/s(9k Reads, 1k Expiry). Start the n1ql query load in parallel.
      5. Scale up cluster to 4-KV, 3-GSI & 3-N1QL. Wait for rebalance to finish.
      6. Turn cluster off and back on.
      7. Scale up cluster to 5-KV, 4-GSI & 4-N1QL.Wait for rebalance to finish.
      8. Turn cluster off and back on.
      9. Scale down cluster to 4-KV, 3-GSI & 3-N1QL.Wait for rebalance to finish.
      10. Turn cluster off and back on.
      11. Scale down cluster to 3-KV, 2-GSI & 2-N1QL.Wait for rebalance to finish.
      12. Turn cluster off and back on.
      13. Do a EBS volume up scaling. Wait for rebalance to finish.
      14. Turn cluster off and back on.
      15. Do a EBS volume down scaling.

      Rebalance seems to be failing because of node getting failed over many times.

      2023-08-15T19:52:21.953Z, ns_orchestrator:0:critical:message(ns_1@svc-d-node-013.qgsopockw4jhf3qd.sandbox.nonprod-project-avengers.com) - Rebalance exited with reason {service_rebalance_failed,index,                                 {agent_died,<34620.2990.0>,noconnection}}.Rebalance Operation Id = b3707548ffa4d611203aed50aeb1510c2023-08-15T19:52:22.020Z, failover:0:info:message(ns_1@svc-d-node-013.qgsopockw4jhf3qd.sandbox.nonprod-project-avengers.com) - Starting failing over ['ns_1@svc-i-node-016.qgsopockw4jhf3qd.sandbox.nonprod-project-avengers.com']2023-08-15T19:52:22.020Z, ns_orchestrator:0:info:message(ns_1@svc-d-node-013.qgsopockw4jhf3qd.sandbox.nonprod-project-avengers.com) - Starting failover of nodes ['ns_1@svc-i-node-016.qgsopockw4jhf3qd.sandbox.nonprod-project-avengers.com']. Operation Id = 3b459ed48506c12925d1739dcd7afcce2023-08-15T19:52:22.139Z, failover:0:info:message(ns_1@svc-d-node-013.qgsopockw4jhf3qd.sandbox.nonprod-project-avengers.com) - Failed over ['ns_1@svc-i-node-016.qgsopockw4jhf3qd.sandbox.nonprod-project-avengers.com']: ok2023-08-15T19:52:24.146Z, failover:0:info:message(ns_1@svc-d-node-013.qgsopockw4jhf3qd.sandbox.nonprod-project-avengers.com) - Deactivating failed over nodes ['ns_1@svc-i-node-016.qgsopockw4jhf3qd.sandbox.nonprod-project-avengers.com']2023-08-15T19:52:24.294Z, ns_orchestrator:0:info:message(ns_1@svc-d-node-013.qgsopockw4jhf3qd.sandbox.nonprod-project-avengers.com) - Failover completed successfully.Rebalance Operation Id = 3b459ed48506c12925d1739dcd7afcce2023-08-15T19:52:24.355Z, auto_failover:0:info:message(ns_1@svc-d-node-013.qgsopockw4jhf3qd.sandbox.nonprod-project-avengers.com) - Node ('ns_1@svc-i-node-016.qgsopockw4jhf3qd.sandbox.nonprod-project-avengers.com') was automatically failed over. Reason: The cluster manager did not respond for the duration of the auto-failover threshold.  

      Not sure if this is a duplicate of/related to MB-57814

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            pavan.pb Pavan PB
            mohsin.ahmed Mohsin Ahmed
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty