Details

    • Technical task
    • Resolution: Fixed
    • Critical
    • 7.6.2
    • 7.6.0
    • fts
    • None
    • 0

    Description

      Found an issue where the computation of the nodes to add/remove lists is incorrect.

      A rebalance-out(adding a node) operation is being incorrectly represented as a swap rebalance(a node removed and another added)

       

      Steps to Reproduce:
      1. Create a 4 node cluster with 2 indexes, each with 4 partition, 1 replica.
      2. Remove 1 node and rebalance in.
      3. Check the nodes to add and remove lists.
      4. Add back the node and rebalance out.
      5. Check the nodes to add and remove lists.

       

      Logs from local reproduction:

      1. Remove node 23cd02d4721d24f810aeac9030267496 from the cluster.

       

      2024-03-15T17:00:36.514+05:30 [INFO] ctl/manager: GetNodeInfo
      2024-03-15T17:00:36.516+05:30 [INFO] ctl/manager: PrepareTopologyChange, change: {f0620f8b613b0f301902565e4339588d [] topology-change-rebalance [{{1801b721ef6f964114dd0e335cdd4de2 0 <nil>} recovery-full} {{b5c59089a94b2af20b0c1371e1eb205b 0 <nil>} recovery-full} {{d2d05d6b078af00a19688f8480ceb485 0 <nil>} recovery-full}] [{23cd02d4721d24f810aeac9030267496 0 <nil>}]}
      2024-03-15T17:00:36.522+05:30 [INFO] ctl/manager: PrepareTopologyChange, done
      2024-03-15T17:00:36.522+05:30 [INFO] ctl/manager: GetTaskList, haveTasksRev: 376, changed, rv: &{Rev:[51 55 56] Tasks:[{Rev:[51 55 55] ID:prepare:f0620f8b613b0f301902565e4339588d Type:task-prepared Status:task-running IsCancelable:true Progress:100 DetailedProgress:map[] Description:prepare topology change ErrorMessage: Extra:map[topologyChange:{ID:f0620f8b613b0f301902565e4339588d CurrentTopologyRev:[] Type:topology-change-rebalance KeepNodes:[{NodeInfo:{NodeID:1801b721ef6f964114dd0e335cdd4de2 Priority:0 Opaque:<nil>} RecoveryType:recovery-full} {NodeInfo:{NodeID:b5c59089a94b2af20b0c1371e1eb205b Priority:0 Opaque:<nil>} RecoveryType:recovery-full} {NodeInfo:{NodeID:d2d05d6b078af00a19688f8480ceb485 Priority:0 Opaque:<nil>} RecoveryType:recovery-full}] EjectNodes:[{NodeID:23cd02d4721d24f810aeac9030267496 Priority:0 Opaque:<nil>}]}]}]}
      ctl prev nodes now [1801b721ef6f964114dd0e335cdd4de2 23cd02d4721d24f810aeac9030267496 b5c59089a94b2af20b0c1371e1eb205b d2d05d6b078af00a19688f8480ceb485] 
      ctl member nodes now [1801b721ef6f964114dd0e335cdd4de2 23cd02d4721d24f810aeac9030267496 b5c59089a94b2af20b0c1371e1eb205b d2d05d6b078af00a19688f8480ceb485] 

       

        The nodes list does not change on each node during the Prepare phase.

      2. The node defs are updated shortly before rebalance ends:

       

      2024-03-15T17:00:37.112+05:30 [INFO] cfg_metakv: metaKVCallback, path: /fts/cbgt/cfg/nodeDefs-known/23cd02d4721d24f810aeac9030267496, key: nodeDefs-known/23cd02d4721d24f810aeac9030267496, deletion: true
      2024-03-15T17:00:37.121+05:30 [INFO] ctl: run, kind: nodeDefs-wanted, updated memberNodes: {1801b721ef6f964114dd0e335cdd4de2;b5c59089a94b2af20b0c1371e1eb205b;d2d05d6b078af00a19688f8480ceb485;} 

      ....

       

      2024-03-15T17:00:37.128+05:30 [INFO] ctl/manager: GetTaskList, haveTasksRev: 564, changed, rv: &{Rev:[53 54 53] Tasks:[]} ctl prev nodes now [1801b721ef6f964114dd0e335cdd4de2 23cd02d4721d24f810aeac9030267496 b5c59089a94b2af20b0c1371e1eb205b d2d05d6b078af00a19688f8480ceb485] 
      ctl member nodes now [08fb49fbde9dcc8981eac976612f0101 1801b721ef6f964114dd0e335cdd4de2 b5c59089a94b2af20b0c1371e1eb205b d2d05d6b078af00a19688f8480ceb485] 

      3. Now, perform a rebalance to add back the node.

      The cfg key for node defs is updated before the Prepare phase:

       

      2024-03-15T17:07:48.988+05:30 [INFO] cfg_metakv: metaKVCallback, path: /fts/cbgt/cfg/nodeDefs-known/08fb49fbde9dcc8981eac976612f0101, key: nodeDefs-known/08fb49fbde9dcc8981eac976612f0101, deletion: false
      2024-03-15T17:07:49.018+05:30 [INFO] cfg_metakv: metaKVCallback, path: /fts/cbgt/cfg/nodeDefs-wanted/08fb49fbde9dcc8981eac976612f0101, key: nodeDefs-wanted/08fb49fbde9dcc8981eac976612f0101, deletion: false 

      ctl prev nodes now [1801b721ef6f964114dd0e335cdd4de2 23cd02d4721d24f810aeac9030267496 b5c59089a94b2af20b0c1371e1eb205b d2d05d6b078af00a19688f8480ceb485]  ctl member nodes now [08fb49fbde9dcc8981eac976612f0101 1801b721ef6f964114dd0e335cdd4de2 b5c59089a94b2af20b0c1371e1eb205b d2d05d6b078af00a19688f8480ceb485] 
      

       

      (This is in contrast to the previous rebalance. I'm mentioning this since the cfg update of 
      node defs is later useful in determining the nodes to add/remove lists).

      4. The computation for the nodes to add/remove list is incorrect 

      2024-03-15T17:07:49.133+05:30 [INFO] rebalance: nodesAll: []string{"08fb49fbde9dcc8981eac976612f0101", "1801b721ef6f964114dd0e335cdd4de2", "23cd02d4721d24f810aeac9030267496", "b5c59089a94b2af20b0c1371e1eb205b", "d2d05d6b078af00a19688f8480ceb485"}
      2024-03-15T17:07:49.133+05:30 [INFO] rebalance: nodesToAdd: []string{"08fb49fbde9dcc8981eac976612f0101"}
      2024-03-15T17:07:49.133+05:30 [INFO] rebalance: nodesToRemove: []string{"23cd02d4721d24f810aeac9030267496"} 

      The output of /api/cfg at this point shows that there are only 4 nodes(as expected). However, the  nodesAll list lists 5 nodes(due to the incorrect prevNodes list computation). 

      "nodeDefsKnown": {
          "uuid": "1976040607",
          "nodeDefs": {
            "08fb49fbde9dcc8981eac976612f0101": {
              "hostPort": "127.0.0.1:9208",
              "uuid": "08fb49fbde9dcc8981eac976612f0101",
              "implVersion": "5.7.0",
              "tags": [
                "feed",
                "janitor",
                "pindex",
                "queryer",
                "cbauth_service"
              ],
              "container": "datacenter/Group 1",
              "weight": 1,
              "extras": "{\"bindGRPC\":\"127.0.0.1:9209\",\"bindGRPCSSL\":\"127.0.0.1:19209\",\"bindHTTPS\":\":19208\",\"features\":\"leanPlan,advMetaEncoding,indexType:scorch,protocol:gRPC,gocbcore:collections,segmentVersion:15,fileTransferRebalance,geoSpatial,vectors\",\"nsHostPort\":\"127.0.0.1:9004\",\"version-cbft.app\":\"v0.6.0\",\"version-cbft.lib\":\"v0.5.5\"}"
            },
            "1801b721ef6f964114dd0e335cdd4de2": {
              "hostPort": "172.16.1.229:9200",
              "uuid": "1801b721ef6f964114dd0e335cdd4de2",
              "implVersion": "5.7.0",
              "tags": [
                "feed",
                "janitor",
                "pindex",
                "queryer",
                "cbauth_service"
              ],
              "container": "datacenter/Group 1",
              "weight": 1,
              "extras": "{\"bindGRPC\":\"172.16.1.229:9201\",\"bindGRPCSSL\":\"172.16.1.229:19201\",\"bindHTTPS\":\":19200\",\"features\":\"leanPlan,advMetaEncoding,indexType:scorch,protocol:gRPC,gocbcore:collections,segmentVersion:15,fileTransferRebalance,geoSpatial,vectors\",\"nsHostPort\":\"172.16.1.229:9000\",\"version-cbft.app\":\"v0.6.0\",\"version-cbft.lib\":\"v0.5.5\"}"
            },
            "b5c59089a94b2af20b0c1371e1eb205b": {
              "hostPort": "127.0.0.1:9202",
              "uuid": "b5c59089a94b2af20b0c1371e1eb205b",
              "implVersion": "5.7.0",
              "tags": [
                "feed",
                "janitor",
                "pindex",
                "queryer",
                "cbauth_service"
              ],
              "container": "datacenter/Group 1",
              "weight": 1,
              "extras": "{\"bindGRPC\":\"127.0.0.1:9203\",\"bindGRPCSSL\":\"127.0.0.1:19203\",\"bindHTTPS\":\":19202\",\"features\":\"leanPlan,advMetaEncoding,indexType:scorch,protocol:gRPC,gocbcore:collections,segmentVersion:15,fileTransferRebalance,geoSpatial,vectors\",\"nsHostPort\":\"127.0.0.1:9001\",\"version-cbft.app\":\"v0.6.0\",\"version-cbft.lib\":\"v0.5.5\"}"
            },
            "d2d05d6b078af00a19688f8480ceb485": {
              "hostPort": "127.0.0.1:9206",
              "uuid": "d2d05d6b078af00a19688f8480ceb485",
              "implVersion": "5.7.0",
              "tags": [
                "feed",
                "janitor",
                "pindex",
                "queryer",
                "cbauth_service"
              ],
              "container": "datacenter/Group 1",
              "weight": 1,
              "extras": "{\"bindGRPC\":\"127.0.0.1:9207\",\"bindGRPCSSL\":\"127.0.0.1:19207\",\"bindHTTPS\":\":19206\",\"features\":\"leanPlan,advMetaEncoding,indexType:scorch,protocol:gRPC,gocbcore:collections,segmentVersion:15,fileTransferRebalance,geoSpatial,vectors\",\"nsHostPort\":\"127.0.0.1:9003\",\"version-cbft.app\":\"v0.6.0\",\"version-cbft.lib\":\"v0.5.5\"}"
            }
          },
          "implVersion": "5.7.0"
        }, 

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            aditi.ahuja Aditi Ahuja
            aditi.ahuja Aditi Ahuja
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty