Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-51202

MultiNodeFailover: Rebalance out of failed nodes operation stuck

    XMLWordPrintable

Details

    Description

       

      Build: 7.1.0-2335

      Scenario:

      • 9 node cluster

        +----------------+----------+
        | 172.23.105.212 | kv       |
        | 172.23.136.106 | n1ql     |
        | 172.23.136.107 | n1ql     |
        | 172.23.136.104 | index    |
        | 172.23.136.108 | n1ql     |
        | 172.23.105.155 | kv       |
        | 172.23.105.213 | index    |
        | 172.23.136.105 | index    |
        | 172.23.105.211 | kv       |
        +----------------+----------+

      • 3 buckets with replica=1 (2 couchbase and 1 Eph)

        +---------+-----------+-----------------+----------+--------+------------+------------+------------+---------+
        | Bucket  | Type      | Storage Backend | Replicas | Items  | RAM Quota  | RAM Used   | Disk Used  | ARR     |
        +---------+-----------+-----------------+----------+--------+------------+------------+------------+---------+
        | bucket1 | couchbase | couchstore      | 1        | 15300  | 300.00 MiB | 65.91 MiB  | 37.92 MiB  | 100     |
        | bucket2 | ephemeral | -               | 1        | 15300  | 300.00 MiB | 44.08 MiB  | 102.0 Byte | -       |
        | default | couchbase | couchstore      | 1        | 500000 | 300.00 MiB | 207.79 MiB | 335.44 MiB | 40.4964 |
        +---------+-----------+-----------------+----------+--------+------------+------------+------------+---------+

      • Auto-failover enabled with max_count=4, timeout=30
      • Failover 4 nodes using stop_couchbase action

        +----------------+----------+----------------+----------------+
        | Node           | Services | Node status    | Failover type  |
        +----------------+----------+----------------+----------------+
        | 172.23.136.106 | n1ql     | inactiveFailed | stop_couchbase |
        | 172.23.105.155 | kv       | inactiveFailed | stop_couchbase |
        | 172.23.105.213 | index    | inactiveFailed | stop_couchbase |
        | 172.23.136.107 | n1ql     | inactiveFailed | stop_couchbase |
        +----------------+----------+----------------+----------------+

      • Failover triggered as expected
      • Load more data into the existing buckets and perform collection add/drop operations
      • Rebalance_out all failover nodes from the cluster

      Observation:

      Data rebalance completed for buckets bucket1 and bucket2 but stuck for 'default' bucket

      Rebalance report:

      {
        "stageInfo": {
          "failover": {
            "totalProgress": 100,
            "perNodeProgress": {
              "ns_1@172.23.105.213": 1,
              "ns_1@172.23.105.155": 1,
              "ns_1@172.23.136.106": 1,
              "ns_1@172.23.136.107": 1
            },
            "startTime": "2022-02-24T22:35:09.465-08:00",
            "completedTime": "2022-02-24T22:35:10.742-08:00",
            "timeTaken": 1278,
            "subStages": {
              "default": {
                "totalProgress": 100,
                "perNodeProgress": {
                  "ns_1@172.23.105.155": 1
                },
                "startTime": "2022-02-24T22:35:10.031-08:00",
                "completedTime": "2022-02-24T22:35:10.341-08:00",
                "timeTaken": 309
              },
              "bucket2": {
                "totalProgress": 100,
                "perNodeProgress": {
                  "ns_1@172.23.105.155": 1
                },
                "startTime": "2022-02-24T22:35:09.770-08:00",
                "completedTime": "2022-02-24T22:35:10.031-08:00",
                "timeTaken": 261
              },
              "bucket1": {
                "totalProgress": 100,
                "perNodeProgress": {
                  "ns_1@172.23.105.155": 1
                },
                "startTime": "2022-02-24T22:35:09.474-08:00",
                "completedTime": "2022-02-24T22:35:09.770-08:00",
                "timeTaken": 296
              }
            }
          }
        },
        "rebalanceId": "489f3556cab111acc4151a7019f282b5",
        "nodesInfo": {
          "active_nodes": [
            "ns_1@172.23.105.155",
            "ns_1@172.23.105.211",
            "ns_1@172.23.105.212",
            "ns_1@172.23.105.213",
            "ns_1@172.23.136.104",
            "ns_1@172.23.136.105",
            "ns_1@172.23.136.106",
            "ns_1@172.23.136.107",
            "ns_1@172.23.136.108"
          ],
          "failover_nodes": [
            "ns_1@172.23.105.155",
            "ns_1@172.23.105.213",
            "ns_1@172.23.136.106",
            "ns_1@172.23.136.107"
          ],
          "master_node": "ns_1@172.23.105.211"
        },
        "masterNode": "ns_1@172.23.105.211",
        "startTime": "2022-02-24T22:35:09.456-08:00",
        "completedTime": "2022-02-24T22:35:10.772-08:00",
        "timeTaken": 1316,
        "completionMessage": "Failover completed successfully."
      }

       

       

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            ashwin.govindarajulu Ashwin Govindarajulu
            ashwin.govindarajulu Ashwin Govindarajulu
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty