Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-60083

[Rebalance] : Rebalance fails with reason {pre_rebalance_janitor_run_failed,"default",{error,marking_as_warmed_failed,['ns_1@172.23.216.74']}}.

    XMLWordPrintable

Details

    Description

      Steps to reproduce

      1. Create a 3 node kv cluster 172.23.216.74, 172.23.216.75, 172.23.216.94
      2. Created a couchstore bucket named 'default'
      3. Loaded a few documents onto the bucket
      4. Failed over node 172.23.216.74
      5. Continued doc loading at this point
      6. Added back node using full recovery
      7. Rebalance failed at this point

      Rebalance failed with the following error

      2023-12-15T10:07:49.371Z, ns_orchestrator:0:critical:message(ns_1@172.23.216.94) - Rebalance exited with reason {pre_rebalance_janitor_run_failed,"default",                                 {error,marking_as_warmed_failed,                                     ['ns_1@172.23.216.74']}}.Rebalance Operation Id = d1419e35f8ece25274d6d208aaccd93d 

      Observing this CRASH REPORT from ns_server.debug.log

      [ns_server:debug,2023-12-15T10:07:42.331Z,ns_1@172.23.216.94:<0.29690.0>:janitor_agent:query_vbuckets_loop:100]Exception from {query_vbuckets,all,[],[{timeout,60000}]} of "default":'ns_1@172.23.216.74'{'EXIT',{noproc,{gen_server,call,                            [{'janitor_agent-default','ns_1@172.23.216.74'},                             {query_vbuckets,all,[],[{timeout,60000}]},                             infinity]}}}[ns_server:debug,2023-12-15T10:07:42.332Z,ns_1@172.23.216.94:<0.29690.0>:janitor_agent:query_vbuckets_loop_next_step:111]Waiting for "default" on 'ns_1@172.23.216.74'[ns_server:warn,2023-12-15T10:07:42.333Z,ns_1@172.23.216.94:capi_doc_replicator-default<0.3180.0>:doc_replicator:loop:108]Remote server node {'capi_ddoc_replication_srv-default','ns_1@172.23.216.74'} process down: noproc[ns_server:debug,2023-12-15T10:07:42.334Z,ns_1@172.23.216.94:capi_doc_replicator-default<0.3180.0>:doc_replicator:loop:74]Replicating all docs to new nodes: ['ns_1@172.23.216.74'][rebalance:info,2023-12-15T10:07:43.335Z,ns_1@172.23.216.94:<0.29495.0>:ns_rebalancer:rebalance_membase_bucket:621]Bucket is ready on all nodes[ns_server:debug,2023-12-15T10:07:43.396Z,ns_1@172.23.216.94:chronicle_kv_log<0.420.0>:chronicle_kv_log:log:59]update (key: {node,'ns_1@172.23.216.74',buckets_with_data}, rev: {<<"47fd1fc65263e82efce05228a90f68e6">>,                                                                  284})[{"default",<<"ae5a16cfbf0f44336ed5ecf35c4aabf4">>}][ns_server:info,2023-12-15T10:07:43.593Z,ns_1@172.23.216.94:ns_doctor<0.500.0>:ns_doctor:update_status:309]The following buckets became ready on node 'ns_1@172.23.216.74': ["default"][ns_server:info,2023-12-15T10:07:45.019Z,ns_1@172.23.216.94:<0.754.0>:ns_orchestrator:handle_event:497]Skipping janitor in state rebalancing[ns_server:error,2023-12-15T10:07:49.370Z,ns_1@172.23.216.94:<0.29789.0>:ns_janitor:cleanup_apply_config_body:306]Failed to mark bucket `"default"` as warmed up.BadReplies:[{'ns_1@172.23.216.74',bad_node}][ns_server:info,2023-12-15T10:07:49.371Z,ns_1@172.23.216.94:rebalance_agent<0.794.0>:rebalance_agent:handle_down:290]Rebalancer process <0.29495.0> died (reason {pre_rebalance_janitor_run_failed,                                             "default",                                             {error,marking_as_warmed_failed,                                              ['ns_1@172.23.216.74']}}).[ns_server:debug,2023-12-15T10:07:49.371Z,ns_1@172.23.216.94:leader_activities<0.669.0>:leader_activities:handle_activity_down:457]Activity terminated with reason {shutdown,                                 {async_died,                                  {raised,                                   {exit,                                    {pre_rebalance_janitor_run_failed,                                     "default",                                     {error,marking_as_warmed_failed,                                      ['ns_1@172.23.216.74']}},                                    [{ns_rebalancer,                                      run_janitor_pre_rebalance,1,                                      [{file,"src/ns_rebalancer.erl"},                                       {line,648}]},                                     {ns_rebalancer,rebalance_membase_bucket,                                      6,                                      [{file,"src/ns_rebalancer.erl"},                                       {line,627}]},                                     {lists,foreach_1,2,                                      [{file,"lists.erl"},{line,1442}]},                                     {ns_rebalancer,rebalance_kv,4,                                      [{file,"src/ns_rebalancer.erl"},                                       {line,573}]},                                     {ns_rebalancer,rebalance_body,5,                                      [{file,"src/ns_rebalancer.erl"},                                       {line,524}]},                                     {async,'-async_init/4-fun-1-',3,                                      [{file,"src/async.erl"},                                       {line,191}]}]}}}}. Activity:{activity,<0.29494.0>,#Ref<0.2206367440.2274885633.7904>,default,          <<"94ddb9294f3eb5ab911960bc984c1447">>,          [rebalance],          majority,[]}[error_logger:error,2023-12-15T10:07:49.371Z,ns_1@172.23.216.94:<0.29491.0>:ale_error_logger_handler:do_log:101]=========================CRASH REPORT=========================  crasher:    initial call: erlang:apply/2    pid: <0.29491.0>    registered_name: []    exception exit: {pre_rebalance_janitor_run_failed,"default",                        {error,marking_as_warmed_failed,                            ['ns_1@172.23.216.74']}}      in function  ns_rebalancer:run_janitor_pre_rebalance/1 (src/ns_rebalancer.erl, line 648)      in call from ns_rebalancer:rebalance_membase_bucket/6 (src/ns_rebalancer.erl, line 627)      in call from lists:foreach_1/2 (lists.erl, line 1442)      in call from ns_rebalancer:rebalance_kv/4 (src/ns_rebalancer.erl, line 573)      in call from ns_rebalancer:rebalance_body/5 (src/ns_rebalancer.erl, line 524)      in call from async:'-async_init/4-fun-1-'/3 (src/async.erl, line 191)    ancestors: [<0.754.0>,ns_orchestrator_child_sup,ns_orchestrator_sup,                  mb_master_sup,mb_master,leader_registry_sup,                  leader_services_sup,<0.654.0>,ns_server_sup,                  ns_server_nodes_sup,<0.279.0>,ns_server_cluster_sup,                  root_sup,<0.149.0>]    message_queue_len: 0    messages: []    links: [<0.754.0>]    dictionary: []    trap_exit: false    status: running    heap_size: 10958    stack_size: 28    reductions: 2212  neighbours:
      [user:error,2023-12-15T10:07:49.371Z,ns_1@172.23.216.94:<0.754.0>:ns_orchestrator:log_rebalance_completion:1435]Rebalance exited with reason {pre_rebalance_janitor_run_failed,"default",                                 {error,marking_as_warmed_failed,                                     ['ns_1@172.23.216.74']}}.Rebalance Operation Id = d1419e35f8ece25274d6d208aaccd93d 

      I am unable to reproduce the issue with the test run and setup; is an intermittent issue. Not a regression


       

      TAF Script to reproduce

      guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /data/workspace/debian-p0-collections-vset00-00-failover_and_recovery_dgm_7.0_P1/testexec.18709.ini GROUP=P0_failover_and_recovery_dgm,rerun=False,get-cbcollect-info=True,log_level=info,upgrade_version=7.2.4-7059,sirius_url=http://172.23.120.103:4000 -t bucket_collections.collections_rebalance.CollectionsRebalance.test_data_load_collections_with_hard_failover_recovery,nodes_init=3,nodes_failover=1,recovery_type=full,bucket_spec=dgm.buckets_for_rebalance_tests,data_load_stage=during,dgm=40,skip_validations=False,GROUP=P0_failover_and_recovery_dgm'

      Job name : debian-collections-failover_and_recovery_dgm_7.0_P1

      Job ref : http://cb-logs-qe.s3-website-us-west-2.amazonaws.com/7.2.4-7059/jenkins_logs/test_suite_executor-TAF/294985/

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            raghav.sk Raghav S K
            raghav.sk Raghav S K
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty