Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-51202

MultiNodeFailover: Rebalance out of failed nodes operation stuck

    XMLWordPrintable

Details

    Description

       

      Build: 7.1.0-2335

      Scenario:

      • 9 node cluster

        +----------------+----------+
        | 172.23.105.212 | kv       |
        | 172.23.136.106 | n1ql     |
        | 172.23.136.107 | n1ql     |
        | 172.23.136.104 | index    |
        | 172.23.136.108 | n1ql     |
        | 172.23.105.155 | kv       |
        | 172.23.105.213 | index    |
        | 172.23.136.105 | index    |
        | 172.23.105.211 | kv       |
        +----------------+----------+

      • 3 buckets with replica=1 (2 couchbase and 1 Eph)

        +---------+-----------+-----------------+----------+--------+------------+------------+------------+---------+
        | Bucket  | Type      | Storage Backend | Replicas | Items  | RAM Quota  | RAM Used   | Disk Used  | ARR     |
        +---------+-----------+-----------------+----------+--------+------------+------------+------------+---------+
        | bucket1 | couchbase | couchstore      | 1        | 15300  | 300.00 MiB | 65.91 MiB  | 37.92 MiB  | 100     |
        | bucket2 | ephemeral | -               | 1        | 15300  | 300.00 MiB | 44.08 MiB  | 102.0 Byte | -       |
        | default | couchbase | couchstore      | 1        | 500000 | 300.00 MiB | 207.79 MiB | 335.44 MiB | 40.4964 |
        +---------+-----------+-----------------+----------+--------+------------+------------+------------+---------+

      • Auto-failover enabled with max_count=4, timeout=30
      • Failover 4 nodes using stop_couchbase action

        +----------------+----------+----------------+----------------+
        | Node           | Services | Node status    | Failover type  |
        +----------------+----------+----------------+----------------+
        | 172.23.136.106 | n1ql     | inactiveFailed | stop_couchbase |
        | 172.23.105.155 | kv       | inactiveFailed | stop_couchbase |
        | 172.23.105.213 | index    | inactiveFailed | stop_couchbase |
        | 172.23.136.107 | n1ql     | inactiveFailed | stop_couchbase |
        +----------------+----------+----------------+----------------+

      • Failover triggered as expected
      • Load more data into the existing buckets and perform collection add/drop operations
      • Rebalance_out all failover nodes from the cluster

      Observation:

      Data rebalance completed for buckets bucket1 and bucket2 but stuck for 'default' bucket

      Rebalance report:

      {
        "stageInfo": {
          "failover": {
            "totalProgress": 100,
            "perNodeProgress": {
              "ns_1@172.23.105.213": 1,
              "ns_1@172.23.105.155": 1,
              "ns_1@172.23.136.106": 1,
              "ns_1@172.23.136.107": 1
            },
            "startTime": "2022-02-24T22:35:09.465-08:00",
            "completedTime": "2022-02-24T22:35:10.742-08:00",
            "timeTaken": 1278,
            "subStages": {
              "default": {
                "totalProgress": 100,
                "perNodeProgress": {
                  "ns_1@172.23.105.155": 1
                },
                "startTime": "2022-02-24T22:35:10.031-08:00",
                "completedTime": "2022-02-24T22:35:10.341-08:00",
                "timeTaken": 309
              },
              "bucket2": {
                "totalProgress": 100,
                "perNodeProgress": {
                  "ns_1@172.23.105.155": 1
                },
                "startTime": "2022-02-24T22:35:09.770-08:00",
                "completedTime": "2022-02-24T22:35:10.031-08:00",
                "timeTaken": 261
              },
              "bucket1": {
                "totalProgress": 100,
                "perNodeProgress": {
                  "ns_1@172.23.105.155": 1
                },
                "startTime": "2022-02-24T22:35:09.474-08:00",
                "completedTime": "2022-02-24T22:35:09.770-08:00",
                "timeTaken": 296
              }
            }
          }
        },
        "rebalanceId": "489f3556cab111acc4151a7019f282b5",
        "nodesInfo": {
          "active_nodes": [
            "ns_1@172.23.105.155",
            "ns_1@172.23.105.211",
            "ns_1@172.23.105.212",
            "ns_1@172.23.105.213",
            "ns_1@172.23.136.104",
            "ns_1@172.23.136.105",
            "ns_1@172.23.136.106",
            "ns_1@172.23.136.107",
            "ns_1@172.23.136.108"
          ],
          "failover_nodes": [
            "ns_1@172.23.105.155",
            "ns_1@172.23.105.213",
            "ns_1@172.23.136.106",
            "ns_1@172.23.136.107"
          ],
          "master_node": "ns_1@172.23.105.211"
        },
        "masterNode": "ns_1@172.23.105.211",
        "startTime": "2022-02-24T22:35:09.456-08:00",
        "completedTime": "2022-02-24T22:35:10.772-08:00",
        "timeTaken": 1316,
        "completionMessage": "Failover completed successfully."
      }

       

       

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          Assigning it to KV team.

          Looks like a DCP back filling issue with respect to 172.23.105.212 node.

          2022-02-24T22:36:26.901915-08:00 INFO (default) Added SeqnoPersistence request for vb:169, requested-seqno:984, high-seqno: 984, persisted-seqno:0, cookie:0x7f48b17a3b00
          2022-02-24T22:36:26.914791-08:00 INFO (default) Notified SeqnoPersistence completion for vb:170 Check for: 1216, Persisted upto: 1216, cookie 0x7f48b164cb80
          2022-02-24T22:36:26.915347-08:00 INFO Pushing new cluster config for bucket:[default] revision:{"epoch":1,"revno":7511}
          2022-02-24T22:36:26.923652-08:00 INFO (default) Notified SeqnoPersistence completion for vb:169 Check for: 984, Persisted upto: 984, cookie 0x7f48b17a3b00
          2022-02-24T22:36:26.924380-08:00 INFO (default) VBucket: created vb:168 with state:replica initialState:dead lastSeqno:0 persistedRange:{0,0} purge_seqno:0 max_cas:0 uuid:205133643477997 topology:null
          2022-02-24T22:36:26.924401-08:00 INFO (default) VBucket::setState: transitioning vb:168 with high seqno:0 from:replica to:replica
          2022-02-24T22:36:26.928187-08:00 INFO 76: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.105.211->ns_1@172.23.105.212:default - (vb:168) Attempting to add stream: opaque_:383, start_seqno_:0, end_seqno_:18446744073709551615, vb_uuid:205133643477997, snap_start_seqno_:0, snap_end_seqno_:0, last_seqno:0, stream_req_value:{"uid":"0"}
          2022-02-24T22:36:26.929486-08:00 INFO 75: [ {"ip":"127.0.0.1","port":43604} - {"ip":"127.0.0.1","port":11209} (System, <ud>@ns_server</ud>) ] Updated cluster configuration for bucket [default]. New revision: {"epoch":1,"revno":7512}
          2022-02-24T22:36:26.932111-08:00 INFO Pushing new cluster config for bucket:[default] revision:{"epoch":1,"revno":7512}
          2022-02-24T22:36:26.939789-08:00 INFO (default) Added SeqnoPersistence request for vb:168, requested-seqno:952, high-seqno: 0, persisted-seqno:0, cookie:0x7f48b164cb80
          2022-02-24T22:36:26.942679-08:00 INFO (default) VBucket: created vb:167 with state:replica initialState:dead lastSeqno:0 persistedRange:{0,0} purge_seqno:0 max_cas:0 uuid:113567003689277 topology:null
          2022-02-24T22:36:26.942716-08:00 INFO (default) VBucket::setState: transitioning vb:167 with high seqno:0 from:replica to:replica
          2022-02-24T22:36:26.951385-08:00 INFO 76: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.105.211->ns_1@172.23.105.212:default - (vb:167) Attempting to add stream: opaque_:384, start_seqno_:0, end_seqno_:18446744073709551615, vb_uuid:113567003689277, snap_start_seqno_:0, snap_end_seqno_:0, last_seqno:0, stream_req_value:{"uid":"0"}
          2022-02-24T22:36:26.965564-08:00 INFO (default) Added SeqnoPersistence request for vb:167, requested-seqno:1100, high-seqno: 0, persisted-seqno:0, cookie:0x7f48b17a3b00
          2022-02-24T22:36:26.969513-08:00 INFO (default) Notified SeqnoPersistence completion for vb:168 Check for: 952, Persisted upto: 952, cookie 0x7f48b164cb80
          2022-02-24T22:36:26.982624-08:00 INFO (default) VBucket: created vb:166 with state:replica initialState:dead lastSeqno:0 persistedRange:{0,0} purge_seqno:0 max_cas:0 uuid:245459350939447 topology:null
          2022-02-24T22:36:26.982653-08:00 INFO (default) VBucket::setState: transitioning vb:166 with high seqno:0 from:replica to:replica
          2022-02-24T22:36:26.983936-08:00 INFO 76: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.105.211->ns_1@172.23.105.212:default - (vb:166) Attempting to add stream: opaque_:385, start_seqno_:0, end_seqno_:18446744073709551615, vb_uuid:245459350939447, snap_start_seqno_:0, snap_end_seqno_:0, last_seqno:0, stream_req_value:{"uid":"0"}
          2022-02-24T22:36:26.984042-08:00 INFO 75: [ {"ip":"127.0.0.1","port":43604} - {"ip":"127.0.0.1","port":11209} (System, <ud>@ns_server</ud>) ] Updated cluster configuration for bucket [default]. New revision: {"epoch":1,"revno":7513}
          2022-02-24T22:36:26.998370-08:00 INFO (default) Added SeqnoPersistence request for vb:166, requested-seqno:1272, high-seqno: 507, persisted-seqno:0, cookie:0x7f48b164cb80
          2022-02-24T22:36:27.003235-08:00 INFO Pushing new cluster config for bucket:[default] revision:{"epoch":1,"revno":7513}
          2022-02-24T22:36:27.047551-08:00 INFO 75: [ {"ip":"127.0.0.1","port":43604} - {"ip":"127.0.0.1","port":11209} (System, <ud>@ns_server</ud>) ] Updated cluster configuration for bucket [default]. New revision: {"epoch":1,"revno":7514}
          2022-02-24T22:36:27.053284-08:00 INFO Pushing new cluster config for bucket:[default] revision:{"epoch":1,"revno":7514}
          2022-02-24T22:36:27.614081-08:00 INFO (default) DCP backfilling task temporarily suspended because the current memory usage is too high
          2022-02-24T22:36:28.607478-08:00 INFO (default) DCP backfilling task temporarily suspended because the current memory usage is too high
          2022-02-24T22:36:29.612483-08:00 INFO (default) DCP backfilling task temporarily suspended because the current memory usage is too high
          2022-02-24T22:36:30.606389-08:00 INFO (default) DCP backfilling task temporarily suspended because the current memory usage is too high

          ashwin.govindarajulu Ashwin Govindarajulu added a comment - Assigning it to KV team. Looks like a DCP back filling issue with respect to 172.23.105.212 node. 2022-02-24T22:36:26.901915-08:00 INFO (default) Added SeqnoPersistence request for vb:169, requested-seqno:984, high-seqno: 984, persisted-seqno:0, cookie:0x7f48b17a3b00 2022-02-24T22:36:26.914791-08:00 INFO (default) Notified SeqnoPersistence completion for vb:170 Check for: 1216, Persisted upto: 1216, cookie 0x7f48b164cb80 2022-02-24T22:36:26.915347-08:00 INFO Pushing new cluster config for bucket:[default] revision:{"epoch":1,"revno":7511} 2022-02-24T22:36:26.923652-08:00 INFO (default) Notified SeqnoPersistence completion for vb:169 Check for: 984, Persisted upto: 984, cookie 0x7f48b17a3b00 2022-02-24T22:36:26.924380-08:00 INFO (default) VBucket: created vb:168 with state:replica initialState:dead lastSeqno:0 persistedRange:{0,0} purge_seqno:0 max_cas:0 uuid:205133643477997 topology:null 2022-02-24T22:36:26.924401-08:00 INFO (default) VBucket::setState: transitioning vb:168 with high seqno:0 from:replica to:replica 2022-02-24T22:36:26.928187-08:00 INFO 76: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.105.211->ns_1@172.23.105.212:default - (vb:168) Attempting to add stream: opaque_:383, start_seqno_:0, end_seqno_:18446744073709551615, vb_uuid:205133643477997, snap_start_seqno_:0, snap_end_seqno_:0, last_seqno:0, stream_req_value:{"uid":"0"} 2022-02-24T22:36:26.929486-08:00 INFO 75: [ {"ip":"127.0.0.1","port":43604} - {"ip":"127.0.0.1","port":11209} (System, <ud>@ns_server</ud>) ] Updated cluster configuration for bucket [default]. New revision: {"epoch":1,"revno":7512} 2022-02-24T22:36:26.932111-08:00 INFO Pushing new cluster config for bucket:[default] revision:{"epoch":1,"revno":7512} 2022-02-24T22:36:26.939789-08:00 INFO (default) Added SeqnoPersistence request for vb:168, requested-seqno:952, high-seqno: 0, persisted-seqno:0, cookie:0x7f48b164cb80 2022-02-24T22:36:26.942679-08:00 INFO (default) VBucket: created vb:167 with state:replica initialState:dead lastSeqno:0 persistedRange:{0,0} purge_seqno:0 max_cas:0 uuid:113567003689277 topology:null 2022-02-24T22:36:26.942716-08:00 INFO (default) VBucket::setState: transitioning vb:167 with high seqno:0 from:replica to:replica 2022-02-24T22:36:26.951385-08:00 INFO 76: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.105.211->ns_1@172.23.105.212:default - (vb:167) Attempting to add stream: opaque_:384, start_seqno_:0, end_seqno_:18446744073709551615, vb_uuid:113567003689277, snap_start_seqno_:0, snap_end_seqno_:0, last_seqno:0, stream_req_value:{"uid":"0"} 2022-02-24T22:36:26.965564-08:00 INFO (default) Added SeqnoPersistence request for vb:167, requested-seqno:1100, high-seqno: 0, persisted-seqno:0, cookie:0x7f48b17a3b00 2022-02-24T22:36:26.969513-08:00 INFO (default) Notified SeqnoPersistence completion for vb:168 Check for: 952, Persisted upto: 952, cookie 0x7f48b164cb80 2022-02-24T22:36:26.982624-08:00 INFO (default) VBucket: created vb:166 with state:replica initialState:dead lastSeqno:0 persistedRange:{0,0} purge_seqno:0 max_cas:0 uuid:245459350939447 topology:null 2022-02-24T22:36:26.982653-08:00 INFO (default) VBucket::setState: transitioning vb:166 with high seqno:0 from:replica to:replica 2022-02-24T22:36:26.983936-08:00 INFO 76: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.105.211->ns_1@172.23.105.212:default - (vb:166) Attempting to add stream: opaque_:385, start_seqno_:0, end_seqno_:18446744073709551615, vb_uuid:245459350939447, snap_start_seqno_:0, snap_end_seqno_:0, last_seqno:0, stream_req_value:{"uid":"0"} 2022-02-24T22:36:26.984042-08:00 INFO 75: [ {"ip":"127.0.0.1","port":43604} - {"ip":"127.0.0.1","port":11209} (System, <ud>@ns_server</ud>) ] Updated cluster configuration for bucket [default]. New revision: {"epoch":1,"revno":7513} 2022-02-24T22:36:26.998370-08:00 INFO (default) Added SeqnoPersistence request for vb:166, requested-seqno:1272, high-seqno: 507, persisted-seqno:0, cookie:0x7f48b164cb80 2022-02-24T22:36:27.003235-08:00 INFO Pushing new cluster config for bucket:[default] revision:{"epoch":1,"revno":7513} 2022-02-24T22:36:27.047551-08:00 INFO 75: [ {"ip":"127.0.0.1","port":43604} - {"ip":"127.0.0.1","port":11209} (System, <ud>@ns_server</ud>) ] Updated cluster configuration for bucket [default]. New revision: {"epoch":1,"revno":7514} 2022-02-24T22:36:27.053284-08:00 INFO Pushing new cluster config for bucket:[default] revision:{"epoch":1,"revno":7514} 2022-02-24T22:36:27.614081-08:00 INFO (default) DCP backfilling task temporarily suspended because the current memory usage is too high 2022-02-24T22:36:28.607478-08:00 INFO (default) DCP backfilling task temporarily suspended because the current memory usage is too high 2022-02-24T22:36:29.612483-08:00 INFO (default) DCP backfilling task temporarily suspended because the current memory usage is too high 2022-02-24T22:36:30.606389-08:00 INFO (default) DCP backfilling task temporarily suspended because the current memory usage is too high
          drigby Dave Rigby added a comment -

          Ashwin Govindarajulu Note that default bucket is value-eviction:

            Bucket            Type Quota (MB) Prio Evict Indx Reps Conf Cmpct Purge (Days) Access Flush Max TTL Compress Storage   Prod   Dev
            ---------------------------------------------------------------------------------------------------------------------------------
            bucket1             CB        100  Low Value        On  Seq   30%  3 (default)   SASL   Off     Off  Passive   Couch    0/0   0/0
            bucket2            Eph        100  Low  None         -  Seq     -            1   SASL   Off     Off  Passive     Eph      -     -
            default             CB        100  Low Value        On  Seq   30%  3 (default)   SASL   Off     Off  Passive   Couch    0/0   0/0
            Total (3 buckets)    -        300   
          

          After the failover, there are 351 missing replica vBuckets:

          * vBucket Details
                                               ---Replicas---
            Bucket   Nodes  Total  Online      1      2     3
            -------------------------------------------------
            bucket1      2   1024    1024   1024    n/a   n/a
            bucket2      2   1024    1024   1024    n/a   n/a
            default      2   1024    1024    673    n/a   n/a
          [BAD]   Replica-1 vbuckets for 'default': 351 are offline
          

          And that bucket is 0% resident on node .212:

          ===Analyser results for 's73612-cnt65.sc.couchbase.com / 'ns_1@172.23.105.212''===
          * Per Node Bucket Stats (CB/Eph buckets only)
                                                            Resident Ratio         In-mem Compression %
            Bucket             CurrItems  KV (MB)  Meta%   Active  Replica  DWQ      Active     Replica
            -------------------------------------------------------------------------------------------
            bucket1                60447       21    9.1      100      100    0           -           -
            bucket2                52986       25   13.2        -        -    -           -           -
            default               276248       43   34.9        0        0    0           -           -
            Total (3 buckets)     389681       89      -        -        -    0           -           -
          

          Given the As such, there is zero free memory left on the remaining two KV cluster nodes for the default bucket. We therefore cannot build the necessary missing replicas (351) - for value eviction those missing 351 replicas vBuckets must have their metadata loaded into memory.

          The observed behaviour is expected - after failing over, a new node (giving additional capacity) needs to be added, or the Bucket quota on the remaining two nodes needs to be increased.

          drigby Dave Rigby added a comment - Ashwin Govindarajulu Note that default bucket is value-eviction: Bucket Type Quota (MB) Prio Evict Indx Reps Conf Cmpct Purge (Days) Access Flush Max TTL Compress Storage Prod Dev --------------------------------------------------------------------------------------------------------------------------------- bucket1 CB 100 Low Value On Seq 30% 3 (default) SASL Off Off Passive Couch 0/0 0/0 bucket2 Eph 100 Low None - Seq - 1 SASL Off Off Passive Eph - - default CB 100 Low Value On Seq 30% 3 (default) SASL Off Off Passive Couch 0/0 0/0 Total (3 buckets) - 300 After the failover, there are 351 missing replica vBuckets: * vBucket Details ---Replicas--- Bucket Nodes Total Online 1 2 3 ------------------------------------------------- bucket1 2 1024 1024 1024 n/a n/a bucket2 2 1024 1024 1024 n/a n/a default 2 1024 1024 673 n/a n/a [BAD] Replica-1 vbuckets for 'default': 351 are offline And that bucket is 0% resident on node .212: ===Analyser results for 's73612-cnt65.sc.couchbase.com / 'ns_1@172.23.105.212''=== * Per Node Bucket Stats (CB/Eph buckets only) Resident Ratio In-mem Compression % Bucket CurrItems KV (MB) Meta% Active Replica DWQ Active Replica ------------------------------------------------------------------------------------------- bucket1 60447 21 9.1 100 100 0 - - bucket2 52986 25 13.2 - - - - - default 276248 43 34.9 0 0 0 - - Total (3 buckets) 389681 89 - - - 0 - - Given the As such, there is zero free memory left on the remaining two KV cluster nodes for the default bucket. We therefore cannot build the necessary missing replicas (351) - for value eviction those missing 351 replicas vBuckets must have their metadata loaded into memory. The observed behaviour is expected - after failing over, a new node (giving additional capacity) needs to be added, or the Bucket quota on the remaining two nodes needs to be increased.
          ashwin.govindarajulu Ashwin Govindarajulu added a comment - - edited

          Tested with 512M bucket. Same test works fine with the increased bucket size.

          Updated the conf file as part of https://review.couchbase.org/c/TAF/+/171284

          ashwin.govindarajulu Ashwin Govindarajulu added a comment - - edited Tested with 512M bucket. Same test works fine with the increased bucket size. Updated the conf file as part of https://review.couchbase.org/c/TAF/+/171284

          Closing this ticket.

          ashwin.govindarajulu Ashwin Govindarajulu added a comment - Closing this ticket.

          People

            ashwin.govindarajulu Ashwin Govindarajulu
            ashwin.govindarajulu Ashwin Govindarajulu
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty