Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-57874

Tombstone purging triggered cancellation of a DCP backfill during rebalance. Was: [System Test] :- Rebalance in of KV nodes fails with "Rebalance exited with reason bad_replicas."

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • 7.2.5
    • 7.2.1
    • sdkdev
    • Enterprise Edition 7.2.1 build 5849
    • Untriaged
    • Centos 64-bit
    • 0
    • No

    Description

      Script to Repro

      ./sequoia -client 172.23.104.27:2375 -provider file:centos_pine.yml -test tests/integration/7.2/test_7.2.yml -scope tests/integration/7.2/scope_7.2_magma.yml -scale 3 -repeat 0 -log_level 0 -version 7.2.1-5849 -skip_setup=false -skip_test=false -skip_teardown=true -skip_cleanup=false -continue=false -collect_on_error=false -stop_on_error=false -duration=604800 -show_topology=true
      

      The test was running fine for 6 days. On day 6 we had a rebalance in of kv nodes which failed.

      [2023-07-16T15:08:59-07:00, sequoiatools/couchbase-cli:7.1:35297f] server-add -c 172.23.108.103:8091 --server-add https://172.23.99.20 -u Administrator -p password --server-add-username Administrator --server-add-password password --services data
      [2023-07-16T15:09:38-07:00, sequoiatools/couchbase-cli:7.1:e896ce] rebalance -c 172.23.108.103:8091 -u Administrator -p password
       
      Error occurred on container - sequoiatools/couchbase-cli:7.1:[rebalance -c 172.23.108.103:8091 -u Administrator -p password]
       
      docker logs e896ce
      docker start e896ce
       
      *Unable to display progress bar on this os
      JERROR: Rebalance failed. See logs for detailed reason. You can try again.
      

      172.23.108.103 : rebalance

      [user:error,2023-07-16T18:29:21.931-07:00,ns_1@172.23.108.103<0.25113.0>:ns_orchestrator:log_rebalance_completion:1433]Rebalance exited with reason bad_replicas.
      

      From debug.log of 172.23.108.103

      [user:info,2023-07-16T14:52:20.377-07:00,ns_1@172.23.108.103:<0.31907.2404>:ns_rebalancer:verify_replication:849]Bad replicators after rebalance:
      Missing = [{'ns_1@172.23.106.100','ns_1@172.23.99.25',69},
                 {'ns_1@172.23.106.100','ns_1@172.23.99.25',186},
                 {'ns_1@172.23.106.100','ns_1@172.23.99.25',258},
                 {'ns_1@172.23.106.100','ns_1@172.23.99.25',464},
                 {'ns_1@172.23.106.100','ns_1@172.23.99.25',550},
                 {'ns_1@172.23.106.100','ns_1@172.23.99.25',790},
                 {'ns_1@172.23.108.103','ns_1@172.23.99.25',88},
                 {'ns_1@172.23.108.103','ns_1@172.23.99.25',89},
                 {'ns_1@172.23.108.103','ns_1@172.23.99.25',176},
                 {'ns_1@172.23.108.103','ns_1@172.23.99.25',177},
                 {'ns_1@172.23.108.103','ns_1@172.23.99.25',181},
                 {'ns_1@172.23.108.103','ns_1@172.23.99.25',182},
                 {'ns_1@172.23.108.103','ns_1@172.23.99.25',185},
                 {'ns_1@172.23.108.103','ns_1@172.23.99.25',202},
                 {'ns_1@172.23.121.117','ns_1@172.23.99.25',288},
                 {'ns_1@172.23.121.117','ns_1@172.23.99.25',349},
                 {'ns_1@172.23.121.117','ns_1@172.23.99.25',364},
                 {'ns_1@172.23.121.117','ns_1@172.23.99.25',365},
                 {'ns_1@172.23.121.117','ns_1@172.23.99.25',366},
                 {'ns_1@172.23.121.117','ns_1@172.23.99.25',368},
                 {'ns_1@172.23.121.117','ns_1@172.23.99.25',371},
                 {'ns_1@172.23.121.117','ns_1@172.23.99.25',777},
                 {'ns_1@172.23.97.121','ns_1@172.23.99.25',549},
                 {'ns_1@172.23.97.121','ns_1@172.23.99.25',551},
                 {'ns_1@172.23.97.121','ns_1@172.23.99.25',552},
                 {'ns_1@172.23.97.121','ns_1@172.23.99.25',553},
                 {'ns_1@172.23.97.121','ns_1@172.23.99.25',554},
                 {'ns_1@172.23.97.121','ns_1@172.23.99.25',555},
                 {'ns_1@172.23.97.121','ns_1@172.23.99.25',556},
                 {'ns_1@172.23.97.121','ns_1@172.23.99.25',557},
                 {'ns_1@172.23.97.121','ns_1@172.23.99.25',558},
                 {'ns_1@172.23.97.121','ns_1@172.23.99.25',750},
                 {'ns_1@172.23.97.122','ns_1@172.23.99.25',626},
                 {'ns_1@172.23.97.122','ns_1@172.23.99.25',644},
                 {'ns_1@172.23.97.122','ns_1@172.23.99.25',648},
                 {'ns_1@172.23.97.122','ns_1@172.23.99.25',651},
                 {'ns_1@172.23.99.21','ns_1@172.23.99.25',842},
                 {'ns_1@172.23.99.21','ns_1@172.23.99.25',921},
                 {'ns_1@172.23.99.21','ns_1@172.23.99.25',923},
                 {'ns_1@172.23.99.21','ns_1@172.23.99.25',924},
                 {'ns_1@172.23.99.21','ns_1@172.23.99.25',925},
                 {'ns_1@172.23.99.21','ns_1@172.23.99.25',926},
                 {'ns_1@172.23.99.21','ns_1@172.23.99.25',927},
                 {'ns_1@172.23.99.21','ns_1@172.23.99.25',928}]
      Extras = []
      [ns_server:info,2023-07-16T14:52:20.379-07:00,ns_1@172.23.108.103:rebalance_agent<0.23399.0>:rebalance_agent:handle_down:290]Rebalancer process <0.31907.2404> died (reason bad_replicas).
      [ns_server:debug,2023-07-16T14:52:20.380-07:00,ns_1@172.23.108.103:leader_activities<0.25076.0>:leader_activities:handle_activity_down:450]Activity terminated with reason {shutdown,
                                       {async_died,
                                        {raised,
                                         {exit,bad_replicas,
                                          [{ns_rebalancer,verify_replication,3,
                                            [{file,"src/ns_rebalancer.erl"},
                                             {line,852}]},
                                           {lists,foreach,2,
                                            [{file,"lists.erl"},{line,1342}]},
                                           {ns_rebalancer,rebalance_kv,4,
                                            [{file,"src/ns_rebalancer.erl"},
                                             {line,573}]},
                                           {ns_rebalancer,rebalance_body,5,
                                            [{file,"src/ns_rebalancer.erl"},
                                             {line,524}]},
                                           {async,'-async_init/4-fun-1-',3,
                                            [{file,"src/async.erl"},
                                             {line,191}]}]}}}}. Activity:
      {activity,<0.32525.2404>,#Ref<0.3410623904.2699821063.174559>,default,
                <<"bc6150dd5e92a7291c7d716fa589547a">>,
                [rebalance],
                majority,[]}
      [error_logger:error,2023-07-16T14:52:20.380-07:00,ns_1@172.23.108.103:<0.26414.2404>:ale_error_logger_handler:do_log:101]
      =========================CRASH REPORT=========================
        crasher:
          initial call: erlang:apply/2
          pid: <0.26414.2404>
          registered_name: []
          exception exit: bad_replicas
            in function  ns_rebalancer:verify_replication/3 (src/ns_rebalancer.erl, line 852)
            in call from lists:foreach/2 (lists.erl, line 1342)
            in call from ns_rebalancer:rebalance_kv/4 (src/ns_rebalancer.erl, line 573)
            in call from ns_rebalancer:rebalance_body/5 (src/ns_rebalancer.erl, line 524)
            in call from async:'-async_init/4-fun-1-'/3 (src/async.erl, line 191)
          ancestors: [<0.25113.0>,ns_orchestrator_child_sup,ns_orchestrator_sup,
                        mb_master_sup,mb_master,leader_registry_sup,
                        leader_services_sup,<0.23335.0>,ns_server_sup,
                        ns_server_nodes_sup,<0.269.0>,ns_server_cluster_sup,
                        root_sup,<0.145.0>]
          message_queue_len: 0
          messages: []
          links: [<0.25113.0>]
          dictionary: []
          trap_exit: false
          status: running
          heap_size: 121536
          stack_size: 29
          reductions: 12697
        neighbours:
       
      [user:error,2023-07-16T14:52:20.388-07:00,ns_1@172.23.108.103:<0.25113.0>:ns_orchestrator:log_rebalance_completion:1433]Rebalance exited with reason bad_replicas.
      Rebalance Operation Id = f1175217dfc503ff2f64e14420629045
      

      We haven't yet had a clean run to get a baseline. Marking this as not a regression.
      cbcollect_info attached.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              pulkit.matta Pulkit Matta
              Balakumaran.Gopal Balakumaran Gopal
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty