Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-41997

Killing babysitter on orchestrator + hard failover + delta recovery + rebalance fails with prepare_rebalance_failed

    XMLWordPrintable

Details

    • Bug
    • Resolution: Incomplete
    • Major
    • None
    • 6.6.1
    • qe
    • None
    • Enterprise Edition 6.6.1 build 9123 ‧ IPv4 © 2020 Couchbase, Inc.
    • Untriaged
    • Centos 64-bit
    • 1
    • No

    Description

      Script to Repro

      ./testrunner -i /tmp/win10-bucket-ops.ini rerun=False -t volumetests.test_system_orchestrator_heartbeats_and_timeouts.volume.test_volume_MB_41562,nodes_init=6,initial_load=500000,replicas=2
      

      Steps to Repro
      1) Set Non default orchestrator heartbeats and timeouts

      curl http://localhost:9000/diag/eval -u Administrator:asdasd -d 'ns_config:set({mb_master, heartbeat_interval}, 500).'
      curl http://localhost:9000/diag/eval -u Administrator:asdasd -d 'ns_config:set({mb_master, timeout_interval_count}, 3).’
      curl http://localhost:9000/diag/eval -u Administrator:asdasd -d 'ns_config:set({leader_lease_acquire_worker, lease_time}, 5000).'
      curl http://localhost:9000/diag/eval -u Administrator:asdasd -d 'ns_config:set({leader_lease_acquire_worker, lease_grace_time}, 2000).'
      curl http://localhost:9000/diag/eval -u Administrator:asdasd -d 'ns_config:set({leader_lease_acquire_worker, lease_renew_after}, 500).'
      

      2. Create a 6 node cluster and 4 buckets.
      ------------------------------------

      Nodes Services Status

      ------------------------------------

      172.23.105.175 kv Cluster node
      172.23.106.250 index Cluster node
      172.23.106.236 kv Cluster node
      172.23.106.251 n1ql Cluster node
      172.23.106.233 kv Cluster node
      172.23.106.238 kv Cluster node

      3. Create data on 4 buckets.
      2020-10-12 08:57:09,473 | test | INFO | MainThread | [table_view:display:72] Bucket statistics
      ------------------------------------------------------------------------

      Bucket Type Replicas Durability TTL Items RAM Quota RAM Used Disk Used

      ------------------------------------------------------------------------

      bucket1 membase 2 none 0 500000 19960692736 393151040 314841262
      bucket2 membase 2 none 0 500000 19960692736 386728464 315357753
      bucket3 membase 2 none 0 500000 19960692736 401526144 278419032
      bucket4 membase 2 none 0 500000 19960692736 427735520 777525742

      ------------------------------------------------------------------------

      4. Start data load and do a rebalance in of node 172.23.120.87
      2020-10-12 08:57:12,467 | test | INFO | pool-1-thread-3 | [table_view:display:72] Rebalance Overview
      ------------------------------------

      Nodes Services Status

      ------------------------------------

      172.23.105.175 kv Cluster node
      172.23.106.250 index Cluster node
      172.23.106.236 kv Cluster node
      172.23.106.251 n1ql Cluster node
      172.23.106.233 kv Cluster node
      172.23.106.238 kv Cluster node
      172.23.120.87 None <--- IN —

      ------------------------------------

      5. Find the orchestrator node(172.23.105.175), kill babysitter on orchestrator, do a hard failover, start couchbase-server, start delta recovery and rebalance

      2020-10-12 09:01:16,887 | test  | INFO    | MainThread | [test_system_orchestrator_heartbeats_and_timeouts:kill_babysittter_process:252] Killing babysitter on : 172.23.105.175
      

      2020-10-12 09:01:18,410 | test  | INFO    | MainThread | [test_system_orchestrator_heartbeats_and_timeouts:kill_babysitter_orchestrator_hard_failover_recovery_and_rebalance:523] Number of failed over nodes : 1
      

      2020-10-12 09:01:18,950 | test  | INFO    | MainThread | [test_system_orchestrator_heartbeats_and_timeouts:start_couchbase_server:259] Starting couchbase server on : 172.23.105.175
      

      2020-10-12 09:13:19,786 | test  | INFO    | MainThread | [test_system_orchestrator_heartbeats_and_timeouts:kill_babysitter_orchestrator_hard_failover_recovery_and_rebalance:537] Doing a delta recovery
      

      2020-10-12 09:13:50,394 | test  | INFO    | pool-1-thread-2 | [table_view:display:72] Rebalance Overview
      +----------------+----------+--------------+
      | Nodes          | Services | Status       |
      +----------------+----------+--------------+
      | 172.23.105.175 | kv       | Cluster node |
      | 172.23.106.250 | index    | Cluster node |
      | 172.23.106.236 | kv       | Cluster node |
      | 172.23.106.251 | n1ql     | Cluster node |
      | 172.23.106.233 | kv       | Cluster node |
      | 172.23.106.238 | kv       | Cluster node |
      | 172.23.120.87  | kv       | Cluster node |
      +----------------+----------+--------------+
       
      2020-10-12 09:13:55,414 | test  | INFO    | pool-1-thread-2 | [task:check:304] Rebalance - status: none, progress: 100
      2020-10-12 09:13:55,446 | test  | INFO    | pool-1-thread-2 | [task:check:363] Rebalance completed with progress: 100% in 5.05200004578 sec
      2020-10-12 09:18:00,467 | test  | INFO    | MainThread | [rest_client:rebalance_reached:105] Rebalance reached >100% in 245.019999981 seconds 
      

      5. Repeat the above step again, this time orchestrator is 172.23.106.233.
      However this time rebalance fails as shown below

      Rebalance exited with reason {prepare_rebalance_failed,
      {error,
      {failed_nodes,
      [{'ns_1@172.23.106.233',{error,timeout}}]}}}.
      Rebalance Operation Id = b2c5b4fea1b0b74c08a0789f1f73d073
      

      I have tried the above tests around 10 times, every single time it failed in the 2nd time of step 4.

      cbcollect_info attached.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            Balakumaran.Gopal Balakumaran Gopal
            Balakumaran.Gopal Balakumaran Gopal
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty