Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-62068

[100M, 1536]: FTS rebalance is hung since 3 days.

    XMLWordPrintable

Details

    Description

      Steps

      1. 3KV and 6 FTS nodes cluster. Each of 16C, 64G. FTS RAM quota is set to 50G
      2. Load 50M base64 encoded 1536 dim vectors to kv.
      3. Build FTS index on it.
      4. Run 1000 upserts per sec on KV.
      5. Start 2 FTS query threads.
      6. Scale UP to 7 nodes with Loading of docs
      7. Scale UP to 8 nodes with Loading of docs
      8. Scale DOWN to 7 nodes with Loading of docs
      9. Scale DOWN to 6 nodes with Loading of docs
      10. Scale Disk with Loading of docs. Triggers swap rebalance for all 6 nodes. One at a time
      11. Scale Disk with Loading of docs. Triggers swap rebalance for all 6 nodes. One at a time
      12. Scale Compute to 32C, 128G with Loading of docs. Triggers swap rebalance for all 6 nodes. One at a time
      13. Scale Compute back to 16C, 32G with Loading of docs. Triggers swap rebalance for all 6 nodes one node at a time. One of the swap rebalance is failed due to 137 kill. To bring back the cluster back to the healthy state CP added back the evicting node back to the cluster and triggered rebalance IN. That rebalance is hung.

      Swap Rebalamce

      Starting rebalance, KeepNodes = ['ns_1@svc-d-node-001.fwyc9tdnlqdmx5jy.sandbox.nonprod-project-avengers.com',
      'ns_1@svc-d-node-002.fwyc9tdnlqdmx5jy.sandbox.nonprod-project-avengers.com',
      'ns_1@svc-d-node-003.fwyc9tdnlqdmx5jy.sandbox.nonprod-project-avengers.com',
      'ns_1@svc-s-node-053.fwyc9tdnlqdmx5jy.sandbox.nonprod-project-avengers.com',
      'ns_1@svc-s-node-054.fwyc9tdnlqdmx5jy.sandbox.nonprod-project-avengers.com',
      'ns_1@svc-s-node-055.fwyc9tdnlqdmx5jy.sandbox.nonprod-project-avengers.com',
      'ns_1@svc-s-node-056.fwyc9tdnlqdmx5jy.sandbox.nonprod-project-avengers.com',
      'ns_1@svc-s-node-057.fwyc9tdnlqdmx5jy.sandbox.nonprod-project-avengers.com',
      'ns_1@svc-s-node-058.fwyc9tdnlqdmx5jy.sandbox.nonprod-project-avengers.com'], EjectNodes = ['ns_1@svc-s-node-050.fwyc9tdnlqdmx5jy.sandbox.nonprod-project-avengers.com'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = f6a458ab8b6d862f8607a8a97d4bb17b
      

      Rebalance Failed due to oom on node 058

      Service 'fts' exited with status 137. Restarting. Messages:
      2024-05-25T09:49:12.931+00:00 [INFO] feed_dcp_gocbcore: newGocbcoreDCPFeed, name: default0_VXXJEVolumeCollection3_fts_idx_1_36cb12525ef39e55_75e1d5b4, indexName: default0_VXXJEVolumeCollection3_fts_idx_1, server: http://127.0.0.1:8091, bucketName: default0_VXXJE, bucketUUID: 0b2837845dbddcd5a6693ecd21a530de
      2024-05-25T09:49:12.931+00:00 [INFO] feed_dcp_gocbcore: Start, name: default0_VXXJEVolumeCollection3_fts_idx_1_36cb12525ef39e55_75e1d5b4, num streams: 28, manifestUID: 6, streamOptions: {FilterOptions: &{ScopeID:0 CollectionIDs:[13]}, StreamOptions: &{StreamID:270}}, vbuckets: [492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519]
      2024-05-25T09:49:12.932+00:00 [INFO] janitor: awakes, op: kick, msg: feed init kick for pindex: default0_VXXJEVolumeCollection3_fts_idx_1_36cb12525ef39e55_75e1d5b4
      2024-05-25T09:49:13.059+00:00 [INFO] janitor: pindexes to remove: 0
      2024-05-25T09:49:13.059+00:00 [INFO] janitor: pindexes to add: 0
      2024-05-25T09:49:13.059+00:00 [INFO] janitor: pindexes to restart: 0
      2024-05-25T09:49:13.059+00:00 [INFO] janitor: pindexes to hibernate: 0
      2024-05-25T09:49:13.061+00:00 [INFO] janitor: feeds to remove: 0
      2024-05-25T09:49:13.061+00:00 [INFO] janitor: feeds to add: 0
       
      Rebalance exited with reason {service_rebalance_failed,fts,
      {agent_died,<37081.3452.0>,
      {lost_connection,
      {'ns_1@svc-s-node-058.fwyc9tdnlqdmx5jy.sandbox.nonprod-project-avengers.com',
      shutdown}}}}.
      Rebalance Operation Id = f6a458ab8b6d862f8607a8a97d4bb17b
      

      CP triggered Rebalance in for 050 node

      Starting rebalance, KeepNodes = ['ns_1@svc-d-node-001.fwyc9tdnlqdmx5jy.sandbox.nonprod-project-avengers.com',
      'ns_1@svc-d-node-002.fwyc9tdnlqdmx5jy.sandbox.nonprod-project-avengers.com',
      'ns_1@svc-d-node-003.fwyc9tdnlqdmx5jy.sandbox.nonprod-project-avengers.com',
      'ns_1@svc-s-node-050.fwyc9tdnlqdmx5jy.sandbox.nonprod-project-avengers.com',
      'ns_1@svc-s-node-053.fwyc9tdnlqdmx5jy.sandbox.nonprod-project-avengers.com',
      'ns_1@svc-s-node-054.fwyc9tdnlqdmx5jy.sandbox.nonprod-project-avengers.com',
      'ns_1@svc-s-node-055.fwyc9tdnlqdmx5jy.sandbox.nonprod-project-avengers.com',
      'ns_1@svc-s-node-056.fwyc9tdnlqdmx5jy.sandbox.nonprod-project-avengers.com',
      'ns_1@svc-s-node-057.fwyc9tdnlqdmx5jy.sandbox.nonprod-project-avengers.com',
      'ns_1@svc-s-node-058.fwyc9tdnlqdmx5jy.sandbox.nonprod-project-avengers.com'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 3348564277ee8878d4bd56fd39a500bf
      

      This one is hung!

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            ritesh.agarwal Ritesh Agarwal
            ritesh.agarwal Ritesh Agarwal
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty