Details
-
Bug
-
Resolution: Not a Bug
-
Major
-
7.6.0
-
Operating System : Debian GNU/Linux 12 (bookworm)
Couchbase Enterprise Edition build 7.6.0-1822
-
Untriaged
-
-
0
-
Unknown
Description
Steps to reproduce
- Created a 2 node kv cluster 172.23.96.168 and 172.23.121.71
- Created an ephemeral bucket named default
- Loaded 213805 items onto the bucket
- Added node 172.23.96.196
- Started a swap rebalance by ejecting node 172.23.96.168
- Parallely started a create syncwrite workload with durability MAJORITY which continued through the rebalance
Rebalance fails with the following error
2023-11-20T00:30:17.528-08:00, ns_orchestrator:0:critical:message(ns_1@172.23.121.71) - Rebalance exited with reason {mover_crashed, {{{{{badmatch, [{<0.29601.0>, {done,exit, {socket_closed, {gen_server,call, [<0.19270.0>, {setup_streams, [512,513,514,515,516,517,518,519, 520,521,522,523,524,525,526,527, 528,529,530,531,532,533,534,535, 536,537,538,539,540,541,542,543, 544,545,546,547,548,549,550,551, 552,553,554,555,556,557,558,559, 560,561,562,563,564,565,566,567, 568,569,570,571,572,573,574,575, 576,577,578,579,580,581,582,583, 584,585,586,587,588,589,590,591, 592,593,594,595,596,597,598,599, 600,601,602,603,604,605,606,607, 608,609,610,611,612,613,614,615, 616,617,618,619,620,621,622,623, 624,625,626,627,628,629,630,631, 632,633,634,635,636,637,638,639, 640,641,642,643,644,645,646,647, 648,649,650,651,652,653,654,655, 656,657,658,659,660,661,662,663, 664,665,666,667,668,669,670,671, 672,673,674,675,676,677,678,679, 680,681,682,683,684,685,686,687, 688,689,690,691,692,693,694,695, 696,697,698,699,700,701,702,703, 704,705,706,707,708,709,710,711, 712,713,714,715,716,717,718,719, 720,721,722,723,724,725,726,727, 728,729,730,731,732,733,734,735, 736,737,738,739,740,741,742,743, 744,745,746,747,748,749,750,751, 752,753,754,755,756,757,758,759, 760,761,762,763,764,765,766,767, 768,769,770,771,772,773,774,775, 776,777,778,779,780,781,782,783, 784,785,786,787,788,789,790,791, 792,793,794,795,796,797,798,799, 800,801,802,803,804,805,806,807, 808,809,810,811,812,813,814,815, 816,817,818,819,820,821,822,823, 824,825,826,827,828,829,830,831, 832,833,834,835,836,837,838,839, 840,841,842,843,844,845,846,847, 848,849,850,851,852,853,854,855, 856,857,858,859,860,861,862,863, 864,865,866,867,868,869,870,871, 872,873,874,875,876,877,878,879, 880,881,882,883,884,885,886,887, 888,889,890,891,892,893,894,895, 896,897,898,899,900,901,902,903, 904,905,906,907,908,909,910,911, 912,913,914,915,916,917,918,919, 920,921,922,923,924,925,926,927, 928,929,930,931,932,933,934,935, 936,937,938,939,940,941,942,943, 944,945,946,947,948,949,950,951, 952,953,954,955,956,957,958,959, 960,961,962,963,964,965,966,967, 968,969,970,971,972,973,974,975, 976,977,978,979,980,981,982,983, 984,985,986,987,988,989,990,991, 992,993,994,995,996,997,998,999, 1000,1001,1002,1003,1004,1005,1006, 1007,1008,1009,1010,1011,1012,1013, 1014,1015,1016,1017,1018,1019,1020, 1022]}, infinity]}}, [{gen_server,call,3, [{file,"gen_server.erl"},{line,385}]}, {dcp_replicator, '-spawn_and_wait/1-fun-0-',1, [{file,"src/dcp_replicator.erl"}, {line,313}]}]}}]}, [{misc, sync_shutdown_many_i_am_trapping_exits,1, [{file,"src/misc.erl"},{line,1517}]}, {dcp_replicator,spawn_and_wait,1, [{file,"src/dcp_replicator.erl"}, {line,334}]}, {dcp_replicator,handle_call,3, [{file,"src/dcp_replicator.erl"}, {line,146}]}, {gen_server,try_handle_call,4, [{file,"gen_server.erl"},{line,1149}]}, {gen_server,handle_msg,6, [{file,"gen_server.erl"},{line,1178}]}, {proc_lib,init_p_do_apply,3, [{file,"proc_lib.erl"},{line,240}]}]}, {gen_server,call, [<0.19269.0>,get_partitions,infinity]}}, {gen_server,call, ['dcp_replication_manager-default', {get_replicator_pid,1006}, infinity]}}, {gen_server,call, [{'janitor_agent-default', 'ns_1@172.23.121.71'}, {if_rebalance,<0.25989.0>, {get_vbucket_high_seqno,428}}, infinity]}}}.Rebalance Operation Id = 540fae399ea7a85faf9aa838cac20c03 |
Observing this in ns_server.debug.log
[error_logger:error,2023-11-20T00:30:17.707-08:00,ns_1@172.23.96.168:dcp_replication_manager-default<0.10274.0>:ale_error_logger_handler:do_log:101]=========================ERROR REPORT=========================** Generic server 'dcp_replication_manager-default' terminating ** Last message in was get_actual_replications** When Server state == "default"** Reason for termination ==** {{{shutdown,nuke},{gen_server,call,[<0.10387.0>,get_partitions,infinity]}}, [{gen_server,call,3,[{file,"gen_server.erl"},{line,385}]}, {dcp_replicator,get_partitions,1, [{file,"src/dcp_replicator.erl"},{line,167}]}, {dcp_replication_manager,'-handle_call/3-lc$^0/1-1-',1, [{file,"src/dcp_replication_manager.erl"}, {line,86}]}, {dcp_replication_manager,handle_call,3, [{file,"src/dcp_replication_manager.erl"}, {line,87}]}, {gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,1149}]}, {gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,1178}]}, {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}** Client ns_heart_slow_status_updater stacktrace** [{gen,do_call,4,[{file,"gen.erl"},{line,237}]}, {gen_server,call,3,[{file,"gen_server.erl"},{line,381}]}, {dcp_replication_manager,get_actual_replications,1, [{file,"src/dcp_replication_manager.erl"}, {line,40}]}, {janitor_agent,this_node_replicator_triples,1, [{file,"src/janitor_agent.erl"},{line,374}]}, {failover_safeness_level,'-build_local_safeness_info/1-lc$^1/1-1-',1, [{file,"src/failover_safeness_level.erl"}, {line,60}]}, {failover_safeness_level,build_local_safeness_info,1, [{file,"src/failover_safeness_level.erl"}, {line,62}]}, {ns_heart,current_status_slow_inner,0, [{file,"src/ns_heart.erl"},{line,301}]}, {ns_heart,current_status_slow,1,[{file,"src/ns_heart.erl"},{line,250}]}] |
[error_logger:error,2023-11-20T00:30:17.707-08:00,ns_1@172.23.96.168:dcp_replication_manager-default<0.10274.0>:ale_error_logger_handler:do_log:101]=========================CRASH REPORT========================= crasher: initial call: dcp_replication_manager:init/1 pid: <0.10274.0> registered_name: 'dcp_replication_manager-default' exception exit: {{shutdown,nuke}, {gen_server,call,[<0.10387.0>,get_partitions,infinity]}} in function gen_server:call/3 (gen_server.erl, line 385) in call from dcp_replicator:get_partitions/1 (src/dcp_replicator.erl, line 167) in call from dcp_replication_manager:'-handle_call/3-lc$^0/1-1-'/1 (src/dcp_replication_manager.erl, line 86) in call from dcp_replication_manager:handle_call/3 (src/dcp_replication_manager.erl, line 87) in call from gen_server:try_handle_call/4 (gen_server.erl, line 1149) in call from gen_server:handle_msg/6 (gen_server.erl, line 1178) ancestors: ['single_bucket_kv_sup-default',ns_bucket_sup, ns_bucket_worker_sup,ns_server_sup,ns_server_nodes_sup, <0.8963.0>,ns_server_cluster_sup,root_sup,<0.155.0>] message_queue_len: 0 messages: [] links: [<0.10260.0>] dictionary: [] trap_exit: false status: running heap_size: 6772 stack_size: 28 reductions: 84658 neighbours: |
|
Also observing this in memcached logs
2023-11-20T00:30:17.161297-08:00 INFO 583: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.121.71->ns_1@172.23.96.196:default - (vb:430) Backfill complete. 0 items consisting of 0 bytes read from disk, 449 items from memory, lastReadSeqno:890, lastSentSeqnoAdvance:0, lastSentSnapStartSeqno:0, lastSentSnapEndSeqno:890, pendingBackfill:False, numBackfillPauses:0. Total runtime 6811 us (0 item/s, 0 MB/s)2023-11-20T00:30:17.161479-08:00 INFO 583: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.121.71->ns_1@172.23.96.196:default - (vb:430) ActiveStream::transitionState: Transitioning from backfilling to in-memory2023-11-20T00:30:17.173304-08:00 WARNING 62: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.96.168->ns_1@172.23.121.71:default - vb:566 Disconnecting the connection as there is no memory to complete replication2023-11-20T00:30:17.173316-08:00 WARNING 62: [ {"ip":"127.0.0.1","port":37878} - {"ip":"127.0.0.1","port":11209} (System, @ns_server) ] dcp.snapshot_marker returned cb::engine_errc::disconnect2023-11-20T00:30:17.173335-08:00 INFO 62: (No Engine) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.96.168->ns_1@172.23.121.71:default - Removing connection [ {"ip":"127.0.0.1","port":37878} - {"ip":"127.0.0.1","port":11209} (System, @ns_server) ] |
TAF Script to reproduce
guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i node.ini upgrade_version=7.6.0-1822 -t rebalance_new.swaprebalancetests.SwapRebalanceBasicTests.do_test,nodes_init=2,replicas=1,standard_buckets=1,num-swap=1,sdk_retries=10,num_items=250000,durability=MAJORITY,bucket_type=ephemeral,sdk_timeout=50,get-cbcollect-info=True,infra_log_level=info,log_level=info,upgrade_version=7.6.0-1813,sirius_url=http://172.23.120.103:4000' |
TAF ref job : debian-durability_rebalance_swap_ephemeral_majority_6.5_P0
TAF job link : http://cb-logs-qe.s3-website-us-west-2.amazonaws.com/7.6.0-1813/jenkins_logs/test_suite_executor-TAF/287346/
Attachments
Issue Links
- is duplicated by
-
MB-60535 Rebalance exited with reason mover_crashed, unexpected_exit, 'EXIT',<0.31278.134>, wait_seqno_persisted_failed,"default",996
- Resolved