Loading...

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Critical
Fix Version/s: None
Affects Version/s: 7.2.4
Component/s: ns_server
Labels:
Environment:
Operating System : Debian GNU/Linux 11 (bullseye)
Couchbase Enterprise Edition 7.2.4-7059

Triage:
Untriaged
Operating System:
Linux x86_64
Link to Log File, atop/blg, CBCollectInfo, Core dump:

Hide
https://cb-engineering.s3.amazonaws.com/MB-60083/collectinfo-2023-12-15T100816-ns_1@172.23.216.74.zip
https://cb-engineering.s3.amazonaws.com/MB-60083/collectinfo-2023-12-15T100816-ns_1@172.23.216.75.zip
https://cb-engineering.s3.amazonaws.com/MB-60083/collectinfo-2023-12-15T100816-ns_1@172.23.216.94.zip

Show
https://cb-engineering.s3.amazonaws.com/MB-60083/collectinfo-2023-12-15T100816-ns_1@172.23.216.74.zip https://cb-engineering.s3.amazonaws.com/MB-60083/collectinfo-2023-12-15T100816-ns_1@172.23.216.75.zip https://cb-engineering.s3.amazonaws.com/MB-60083/collectinfo-2023-12-15T100816-ns_1@172.23.216.94.zip
Story Points:
0
Is this a Regression?:
Unknown

Description

Steps to reproduce

Create a 3 node kv cluster 172.23.216.74, 172.23.216.75, 172.23.216.94
Created a couchstore bucket named 'default'
Loaded a few documents onto the bucket
Failed over node 172.23.216.74
Continued doc loading at this point
Added back node using full recovery
Rebalance failed at this point

Rebalance failed with the following error

2023-12-15T10:07:49.371Z, ns_orchestrator:0:critical:message(ns_1@172.23.216.94) - Rebalance exited with reason {pre_rebalance_janitor_run_failed,"default",                                 {error,marking_as_warmed_failed,                                     ['ns_1@172.23.216.74']}}.Rebalance Operation Id = d1419e35f8ece25274d6d208aaccd93d

Observing this CRASH REPORT from ns_server.debug.log

[ns_server:debug,2023-12-15T10:07:42.331Z,ns_1@172.23.216.94:<0.29690.0>:janitor_agent:query_vbuckets_loop:100]Exception from {query_vbuckets,all,[],[{timeout,60000}]} of "default":'ns_1@172.23.216.74'{'EXIT',{noproc,{gen_server,call,                            [{'janitor_agent-default','ns_1@172.23.216.74'},                             {query_vbuckets,all,[],[{timeout,60000}]},                             infinity]}}}[ns_server:debug,2023-12-15T10:07:42.332Z,ns_1@172.23.216.94:<0.29690.0>:janitor_agent:query_vbuckets_loop_next_step:111]Waiting for "default" on 'ns_1@172.23.216.74'[ns_server:warn,2023-12-15T10:07:42.333Z,ns_1@172.23.216.94:capi_doc_replicator-default<0.3180.0>:doc_replicator:loop:108]Remote server node {'capi_ddoc_replication_srv-default','ns_1@172.23.216.74'} process down: noproc[ns_server:debug,2023-12-15T10:07:42.334Z,ns_1@172.23.216.94:capi_doc_replicator-default<0.3180.0>:doc_replicator:loop:74]Replicating all docs to new nodes: ['ns_1@172.23.216.74'][rebalance:info,2023-12-15T10:07:43.335Z,ns_1@172.23.216.94:<0.29495.0>:ns_rebalancer:rebalance_membase_bucket:621]Bucket is ready on all nodes[ns_server:debug,2023-12-15T10:07:43.396Z,ns_1@172.23.216.94:chronicle_kv_log<0.420.0>:chronicle_kv_log:log:59]update (key: {node,'ns_1@172.23.216.74',buckets_with_data}, rev: {<<"47fd1fc65263e82efce05228a90f68e6">>,                                                                  284})[{"default",<<"ae5a16cfbf0f44336ed5ecf35c4aabf4">>}][ns_server:info,2023-12-15T10:07:43.593Z,ns_1@172.23.216.94:ns_doctor<0.500.0>:ns_doctor:update_status:309]The following buckets became ready on node 'ns_1@172.23.216.74': ["default"][ns_server:info,2023-12-15T10:07:45.019Z,ns_1@172.23.216.94:<0.754.0>:ns_orchestrator:handle_event:497]Skipping janitor in state rebalancing[ns_server:error,2023-12-15T10:07:49.370Z,ns_1@172.23.216.94:<0.29789.0>:ns_janitor:cleanup_apply_config_body:306]Failed to mark bucket `"default"` as warmed up.BadReplies:[{'ns_1@172.23.216.74',bad_node}][ns_server:info,2023-12-15T10:07:49.371Z,ns_1@172.23.216.94:rebalance_agent<0.794.0>:rebalance_agent:handle_down:290]Rebalancer process <0.29495.0> died (reason {pre_rebalance_janitor_run_failed,                                             "default",                                             {error,marking_as_warmed_failed,                                              ['ns_1@172.23.216.74']}}).[ns_server:debug,2023-12-15T10:07:49.371Z,ns_1@172.23.216.94:leader_activities<0.669.0>:leader_activities:handle_activity_down:457]Activity terminated with reason {shutdown,                                 {async_died,                                  {raised,                                   {exit,                                    {pre_rebalance_janitor_run_failed,                                     "default",                                     {error,marking_as_warmed_failed,                                      ['ns_1@172.23.216.74']}},                                    [{ns_rebalancer,                                      run_janitor_pre_rebalance,1,                                      [{file,"src/ns_rebalancer.erl"},                                       {line,648}]},                                     {ns_rebalancer,rebalance_membase_bucket,                                      6,                                      [{file,"src/ns_rebalancer.erl"},                                       {line,627}]},                                     {lists,foreach_1,2,                                      [{file,"lists.erl"},{line,1442}]},                                     {ns_rebalancer,rebalance_kv,4,                                      [{file,"src/ns_rebalancer.erl"},                                       {line,573}]},                                     {ns_rebalancer,rebalance_body,5,                                      [{file,"src/ns_rebalancer.erl"},                                       {line,524}]},                                     {async,'-async_init/4-fun-1-',3,                                      [{file,"src/async.erl"},                                       {line,191}]}]}}}}. Activity:{activity,<0.29494.0>,#Ref<0.2206367440.2274885633.7904>,default,          <<"94ddb9294f3eb5ab911960bc984c1447">>,          [rebalance],          majority,[]}[error_logger:error,2023-12-15T10:07:49.371Z,ns_1@172.23.216.94:<0.29491.0>:ale_error_logger_handler:do_log:101]=========================CRASH REPORT=========================  crasher:    initial call: erlang:apply/2    pid: <0.29491.0>    registered_name: []    exception exit: {pre_rebalance_janitor_run_failed,"default",                        {error,marking_as_warmed_failed,                            ['ns_1@172.23.216.74']}}      in function  ns_rebalancer:run_janitor_pre_rebalance/1 (src/ns_rebalancer.erl, line 648)      in call from ns_rebalancer:rebalance_membase_bucket/6 (src/ns_rebalancer.erl, line 627)      in call from lists:foreach_1/2 (lists.erl, line 1442)      in call from ns_rebalancer:rebalance_kv/4 (src/ns_rebalancer.erl, line 573)      in call from ns_rebalancer:rebalance_body/5 (src/ns_rebalancer.erl, line 524)      in call from async:'-async_init/4-fun-1-'/3 (src/async.erl, line 191)    ancestors: [<0.754.0>,ns_orchestrator_child_sup,ns_orchestrator_sup,                  mb_master_sup,mb_master,leader_registry_sup,                  leader_services_sup,<0.654.0>,ns_server_sup,                  ns_server_nodes_sup,<0.279.0>,ns_server_cluster_sup,                  root_sup,<0.149.0>]    message_queue_len: 0    messages: []    links: [<0.754.0>]    dictionary: []    trap_exit: false    status: running    heap_size: 10958    stack_size: 28    reductions: 2212  neighbours:

[user:error,2023-12-15T10:07:49.371Z,ns_1@172.23.216.94:<0.754.0>:ns_orchestrator:log_rebalance_completion:1435]Rebalance exited with reason {pre_rebalance_janitor_run_failed,"default",                                 {error,marking_as_warmed_failed,                                     ['ns_1@172.23.216.74']}}.Rebalance Operation Id = d1419e35f8ece25274d6d208aaccd93d

I am unable to reproduce the issue with the test run and setup; is an intermittent issue. Not a regression

TAF Script to reproduce

guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /data/workspace/debian-p0-collections-vset00-00-failover_and_recovery_dgm_7.0_P1/testexec.18709.ini GROUP=P0_failover_and_recovery_dgm,rerun=False,get-cbcollect-info=True,log_level=info,upgrade_version=7.2.4-7059,sirius_url=http://172.23.120.103:4000 -t bucket_collections.collections_rebalance.CollectionsRebalance.test_data_load_collections_with_hard_failover_recovery,nodes_init=3,nodes_failover=1,recovery_type=full,bucket_spec=dgm.buckets_for_rebalance_tests,data_load_stage=during,dgm=40,skip_validations=False,GROUP=P0_failover_and_recovery_dgm'

Job name : debian-collections-failover_and_recovery_dgm_7.0_P1

Job ref : http://cb-logs-qe.s3-website-us-west-2.amazonaws.com/7.2.4-7059/jenkins_logs/test_suite_executor-TAF/294985/

Attachments

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

[Rebalance] : Rebalance fails with reason {pre_rebalance_janitor_run_failed,"default",{error,marking_as_warmed_failed,['ns_1@172.23.216.74']}}.

Details

Description

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty