Details
-
Bug
-
Resolution: Cannot Reproduce
-
Critical
-
None
-
7.2.4
-
Operating System : Debian GNU/Linux 11 (bullseye)
Couchbase Enterprise Edition 7.2.4-7059
-
Untriaged
-
Linux x86_64
-
-
0
-
Unknown
Description
Steps to reproduce
- Create a 3 node kv cluster 172.23.216.74, 172.23.216.75, 172.23.216.94
- Created a couchstore bucket named 'default'
- Loaded a few documents onto the bucket
- Failed over node 172.23.216.74
- Continued doc loading at this point
- Added back node using full recovery
- Rebalance failed at this point
Rebalance failed with the following error
2023-12-15T10:07:49.371Z, ns_orchestrator:0:critical:message(ns_1@172.23.216.94) - Rebalance exited with reason {pre_rebalance_janitor_run_failed,"default", {error,marking_as_warmed_failed, ['ns_1@172.23.216.74']}}.Rebalance Operation Id = d1419e35f8ece25274d6d208aaccd93d |
Observing this CRASH REPORT from ns_server.debug.log
[ns_server:debug,2023-12-15T10:07:42.331Z,ns_1@172.23.216.94:<0.29690.0>:janitor_agent:query_vbuckets_loop:100]Exception from {query_vbuckets,all,[],[{timeout,60000}]} of "default":'ns_1@172.23.216.74'{'EXIT',{noproc,{gen_server,call, [{'janitor_agent-default','ns_1@172.23.216.74'}, {query_vbuckets,all,[],[{timeout,60000}]}, infinity]}}}[ns_server:debug,2023-12-15T10:07:42.332Z,ns_1@172.23.216.94:<0.29690.0>:janitor_agent:query_vbuckets_loop_next_step:111]Waiting for "default" on 'ns_1@172.23.216.74'[ns_server:warn,2023-12-15T10:07:42.333Z,ns_1@172.23.216.94:capi_doc_replicator-default<0.3180.0>:doc_replicator:loop:108]Remote server node {'capi_ddoc_replication_srv-default','ns_1@172.23.216.74'} process down: noproc[ns_server:debug,2023-12-15T10:07:42.334Z,ns_1@172.23.216.94:capi_doc_replicator-default<0.3180.0>:doc_replicator:loop:74]Replicating all docs to new nodes: ['ns_1@172.23.216.74'][rebalance:info,2023-12-15T10:07:43.335Z,ns_1@172.23.216.94:<0.29495.0>:ns_rebalancer:rebalance_membase_bucket:621]Bucket is ready on all nodes[ns_server:debug,2023-12-15T10:07:43.396Z,ns_1@172.23.216.94:chronicle_kv_log<0.420.0>:chronicle_kv_log:log:59]update (key: {node,'ns_1@172.23.216.74',buckets_with_data}, rev: {<<"47fd1fc65263e82efce05228a90f68e6">>, 284})[{"default",<<"ae5a16cfbf0f44336ed5ecf35c4aabf4">>}][ns_server:info,2023-12-15T10:07:43.593Z,ns_1@172.23.216.94:ns_doctor<0.500.0>:ns_doctor:update_status:309]The following buckets became ready on node 'ns_1@172.23.216.74': ["default"][ns_server:info,2023-12-15T10:07:45.019Z,ns_1@172.23.216.94:<0.754.0>:ns_orchestrator:handle_event:497]Skipping janitor in state rebalancing[ns_server:error,2023-12-15T10:07:49.370Z,ns_1@172.23.216.94:<0.29789.0>:ns_janitor:cleanup_apply_config_body:306]Failed to mark bucket `"default"` as warmed up.BadReplies:[{'ns_1@172.23.216.74',bad_node}][ns_server:info,2023-12-15T10:07:49.371Z,ns_1@172.23.216.94:rebalance_agent<0.794.0>:rebalance_agent:handle_down:290]Rebalancer process <0.29495.0> died (reason {pre_rebalance_janitor_run_failed, "default", {error,marking_as_warmed_failed, ['ns_1@172.23.216.74']}}).[ns_server:debug,2023-12-15T10:07:49.371Z,ns_1@172.23.216.94:leader_activities<0.669.0>:leader_activities:handle_activity_down:457]Activity terminated with reason {shutdown, {async_died, {raised, {exit, {pre_rebalance_janitor_run_failed, "default", {error,marking_as_warmed_failed, ['ns_1@172.23.216.74']}}, [{ns_rebalancer, run_janitor_pre_rebalance,1, [{file,"src/ns_rebalancer.erl"}, {line,648}]}, {ns_rebalancer,rebalance_membase_bucket, 6, [{file,"src/ns_rebalancer.erl"}, {line,627}]}, {lists,foreach_1,2, [{file,"lists.erl"},{line,1442}]}, {ns_rebalancer,rebalance_kv,4, [{file,"src/ns_rebalancer.erl"}, {line,573}]}, {ns_rebalancer,rebalance_body,5, [{file,"src/ns_rebalancer.erl"}, {line,524}]}, {async,'-async_init/4-fun-1-',3, [{file,"src/async.erl"}, {line,191}]}]}}}}. Activity:{activity,<0.29494.0>,#Ref<0.2206367440.2274885633.7904>,default, <<"94ddb9294f3eb5ab911960bc984c1447">>, [rebalance], majority,[]}[error_logger:error,2023-12-15T10:07:49.371Z,ns_1@172.23.216.94:<0.29491.0>:ale_error_logger_handler:do_log:101]=========================CRASH REPORT========================= crasher: initial call: erlang:apply/2 pid: <0.29491.0> registered_name: [] exception exit: {pre_rebalance_janitor_run_failed,"default", {error,marking_as_warmed_failed, ['ns_1@172.23.216.74']}} in function ns_rebalancer:run_janitor_pre_rebalance/1 (src/ns_rebalancer.erl, line 648) in call from ns_rebalancer:rebalance_membase_bucket/6 (src/ns_rebalancer.erl, line 627) in call from lists:foreach_1/2 (lists.erl, line 1442) in call from ns_rebalancer:rebalance_kv/4 (src/ns_rebalancer.erl, line 573) in call from ns_rebalancer:rebalance_body/5 (src/ns_rebalancer.erl, line 524) in call from async:'-async_init/4-fun-1-'/3 (src/async.erl, line 191) ancestors: [<0.754.0>,ns_orchestrator_child_sup,ns_orchestrator_sup, mb_master_sup,mb_master,leader_registry_sup, leader_services_sup,<0.654.0>,ns_server_sup, ns_server_nodes_sup,<0.279.0>,ns_server_cluster_sup, root_sup,<0.149.0>] message_queue_len: 0 messages: [] links: [<0.754.0>] dictionary: [] trap_exit: false status: running heap_size: 10958 stack_size: 28 reductions: 2212 neighbours: |
[user:error,2023-12-15T10:07:49.371Z,ns_1@172.23.216.94:<0.754.0>:ns_orchestrator:log_rebalance_completion:1435]Rebalance exited with reason {pre_rebalance_janitor_run_failed,"default", {error,marking_as_warmed_failed, ['ns_1@172.23.216.74']}}.Rebalance Operation Id = d1419e35f8ece25274d6d208aaccd93d |
I am unable to reproduce the issue with the test run and setup; is an intermittent issue. Not a regression
guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /data/workspace/debian-p0-collections-vset00-00-failover_and_recovery_dgm_7.0_P1/testexec.18709.ini GROUP=P0_failover_and_recovery_dgm,rerun=False,get-cbcollect-info=True,log_level=info,upgrade_version=7.2.4-7059,sirius_url=http://172.23.120.103:4000 -t bucket_collections.collections_rebalance.CollectionsRebalance.test_data_load_collections_with_hard_failover_recovery,nodes_init=3,nodes_failover=1,recovery_type=full,bucket_spec=dgm.buckets_for_rebalance_tests,data_load_stage=during,dgm=40,skip_validations=False,GROUP=P0_failover_and_recovery_dgm'
Job name : debian-collections-failover_and_recovery_dgm_7.0_P1