Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-56889

[System test upgrade] :- Anlaytics Rebalance fails with Rebalance 7fef0ad83705736f70d24831ccdf0c6e failed: Index with resource ID 6245 already exists.

    XMLWordPrintable

Details

    • Untriaged
    • Centos 64-bit
    • 0
    • No
    • Analytics Sprint 20

    Description

      Steps to Repro
      1. Run a longevity test on 7.1.4 for 2 days.

      ./sequoia -client 172.23.104.27:2375 -provider file:centos_pine.yml -test tests/integration/neo/test_neo.yml -scope tests/integration/neo/scope_neo_magma.yml -scale 3 -repeat 0 -log_level 0 -version 7.1.4-3601 -skip_setup=false -skip_test=false -skip_teardown=true -skip_cleanup=false -continue=false -collect_on_error=false -stop_on_error=false -duration=604800 -show_topology=true
      

      2. Upgraded to 7.2.0-5324 using online upgrade with failover/recovery strategy.
      3. Enabled CDC on all buckets and on some collections post upgrade

      I did failover of one node of each service(data,index,query,analytics,eventing,search)
      and did an rebalance out which failed.

      172.23.120.58 1:12:55 AMĀ 15 May, 2023

      Starting rebalance, KeepNodes = ['ns_1@172.23.120.75','ns_1@172.23.120.81',
      'ns_1@172.23.120.86','ns_1@172.23.121.77',
      'ns_1@172.23.123.25','ns_1@172.23.123.26',
      'ns_1@172.23.123.31','ns_1@172.23.123.33',
      'ns_1@172.23.96.243','ns_1@172.23.96.254',
      'ns_1@172.23.96.48','ns_1@172.23.97.105',
      'ns_1@172.23.97.110','ns_1@172.23.97.112',
      'ns_1@172.23.97.148','ns_1@172.23.97.241',
      'ns_1@172.23.97.74'], EjectNodes = [], Failed over and being ejected nodes = ['ns_1@172.23.120.58',
      'ns_1@172.23.120.73',
      'ns_1@172.23.120.74',
      'ns_1@172.23.120.77',
      'ns_1@172.23.123.32',
      'ns_1@172.23.96.122']; no delta recovery nodes; Operation Id = 0b87a72990070d13f578d3d9630d8b70
      

      172.23.120.58 3:40:06 AMĀ 15 May, 2023

      Rebalance exited with reason {service_rebalance_failed,cbas,
      {worker_died,
      {'EXIT',<0.32352.1156>,
      {rebalance_failed,
      {service_error,
      <<"Rebalance 7fef0ad83705736f70d24831ccdf0c6e failed: Index with resource ID 6245 already exists.">>}}}}}.
      Rebalance Operation Id = 0b87a72990070d13f578d3d9630d8b70
      

      Retried failed rebalance

      172.23.120.75 3:45:40 AMĀ 15 May, 2023

      Starting rebalance, KeepNodes = ['ns_1@172.23.120.75','ns_1@172.23.120.81',
      'ns_1@172.23.120.86','ns_1@172.23.121.77',
      'ns_1@172.23.123.25','ns_1@172.23.123.26',
      'ns_1@172.23.123.31','ns_1@172.23.123.33',
      'ns_1@172.23.96.243','ns_1@172.23.96.254',
      'ns_1@172.23.96.48','ns_1@172.23.97.105',
      'ns_1@172.23.97.110','ns_1@172.23.97.112',
      'ns_1@172.23.97.148','ns_1@172.23.97.241',
      'ns_1@172.23.97.74'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 3c4e20642bb3df706e7b2a68bafe3faf
      

      I noticed the rebalance progress keeps increasing beyond 100%.

      balakumaran.g@Balakumarans-MacBook-Pro-2 sequoia %  curl -u Administrator:password http://172.23.97.74:8091/pools/default/rebalanceProgress | jq
       
        % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                       Dload  Upload   Total   Spent    Left  Speed
      100   690  100   690    0     0   1296      0 --:--:-- --:--:-- --:--:--  1294
      {
        "status": "running",
        "ns_1@172.23.97.110": {
          "progress": 0
        },
        "ns_1@172.23.120.81": {
          "progress": 1
        },
        "ns_1@172.23.97.241": {
          "progress": 1
        },
        "ns_1@172.23.123.31": {
          "progress": 1
        },
        "ns_1@172.23.97.112": {
          "progress": 1
        },
        "ns_1@172.23.96.243": {
          "progress": 1
        },
        "ns_1@172.23.123.33": {
          "progress": 1
        },
        "ns_1@172.23.97.74": {
          "progress": 1
        },
        "ns_1@172.23.96.254": {
          "progress": 0
        },
        "ns_1@172.23.120.75": {
          "progress": 0
        },
        "ns_1@172.23.123.25": {
          "progress": 0
        },
        "ns_1@172.23.97.105": {
          "progress": 1
        },
        "ns_1@172.23.120.86": {
          "progress": 2.569999999999968e-13
        },
        "ns_1@172.23.123.26": {
          "progress": 1
        },
        "ns_1@172.23.121.77": {
          "progress": 2.569999999999968e-13
        },
        "ns_1@172.23.96.48": {
          "progress": 2.569999999999968e-13
        },
        "ns_1@172.23.97.148": {
          "progress": 0
        }
      }
      balakumaran.g@Balakumarans-MacBook-Pro-2 sequoia % 
      balakumaran.g@Balakumarans-MacBook-Pro-2 sequoia %  curl -u Administrator:password http://172.23.97.74:8091/pools/default/rebalanceProgress | jq
       
        % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                       Dload  Upload   Total   Spent    Left  Speed
      100   690  100   690    0     0    864      0 --:--:-- --:--:-- --:--:--   863
      {
        "status": "running",
        "ns_1@172.23.97.110": {
          "progress": 0
        },
        "ns_1@172.23.120.81": {
          "progress": 1
        },
        "ns_1@172.23.97.241": {
          "progress": 1
        },
        "ns_1@172.23.123.31": {
          "progress": 1
        },
        "ns_1@172.23.97.112": {
          "progress": 1
        },
        "ns_1@172.23.96.243": {
          "progress": 1
        },
        "ns_1@172.23.123.33": {
          "progress": 1
        },
        "ns_1@172.23.97.74": {
          "progress": 1
        },
        "ns_1@172.23.96.254": {
          "progress": 0
        },
        "ns_1@172.23.120.75": {
          "progress": 0
        },
        "ns_1@172.23.123.25": {
          "progress": 0
        },
        "ns_1@172.23.97.105": {
          "progress": 1
        },
        "ns_1@172.23.120.86": {
          "progress": 4.123000000000275e-13
        },
        "ns_1@172.23.123.26": {
          "progress": 1
        },
        "ns_1@172.23.121.77": {
          "progress": 4.123000000000275e-13
        },
        "ns_1@172.23.96.48": {
          "progress": 4.123000000000275e-13
        },
        "ns_1@172.23.97.148": {
          "progress": 0
        }
      }
      balakumaran.g@Balakumarans-MacBook-Pro-2 sequoia % 
      

      My guess is probably the rebalance failure is the cause of this unusual behaviour. Would this result in a hang? If so, could we get some workaround ?

      cbcollect_info attached.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              Balakumaran.Gopal Balakumaran Gopal
              Balakumaran.Gopal Balakumaran Gopal
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty