Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-59721

[Rebalance] Rebalance failed with reason {mover_crashed,{{{{{badmatch,[{<0.29601.0>,{done,exit,{socket_closed,{gen_server,call,[<0.19270.0>,{setup_streams,

    XMLWordPrintable

Details

    Description

      Steps to reproduce

      1. Created a 2 node kv cluster 172.23.96.168 and 172.23.121.71
      2. Created an ephemeral bucket named default 
      3. Loaded 213805 items onto the bucket
      4. Added node 172.23.96.196
      5. Started a swap rebalance by ejecting node 172.23.96.168
      6. Parallely started a create syncwrite workload with durability MAJORITY which continued through the rebalance

      Rebalance fails with the following error

      2023-11-20T00:30:17.528-08:00, ns_orchestrator:0:critical:message(ns_1@172.23.121.71) - Rebalance exited with reason {mover_crashed,                              {{{{{badmatch,                                   [{<0.29601.0>,                                     {done,exit,                                      {socket_closed,                                       {gen_server,call,                                        [<0.19270.0>,                                         {setup_streams,                                          [512,513,514,515,516,517,518,519,                                           520,521,522,523,524,525,526,527,                                           528,529,530,531,532,533,534,535,                                           536,537,538,539,540,541,542,543,                                           544,545,546,547,548,549,550,551,                                           552,553,554,555,556,557,558,559,                                           560,561,562,563,564,565,566,567,                                           568,569,570,571,572,573,574,575,                                           576,577,578,579,580,581,582,583,                                           584,585,586,587,588,589,590,591,                                           592,593,594,595,596,597,598,599,                                           600,601,602,603,604,605,606,607,                                           608,609,610,611,612,613,614,615,                                           616,617,618,619,620,621,622,623,                                           624,625,626,627,628,629,630,631,                                           632,633,634,635,636,637,638,639,                                           640,641,642,643,644,645,646,647,                                           648,649,650,651,652,653,654,655,                                           656,657,658,659,660,661,662,663,                                           664,665,666,667,668,669,670,671,                                           672,673,674,675,676,677,678,679,                                           680,681,682,683,684,685,686,687,                                           688,689,690,691,692,693,694,695,                                           696,697,698,699,700,701,702,703,                                           704,705,706,707,708,709,710,711,                                           712,713,714,715,716,717,718,719,                                           720,721,722,723,724,725,726,727,                                           728,729,730,731,732,733,734,735,                                           736,737,738,739,740,741,742,743,                                           744,745,746,747,748,749,750,751,                                           752,753,754,755,756,757,758,759,                                           760,761,762,763,764,765,766,767,                                           768,769,770,771,772,773,774,775,                                           776,777,778,779,780,781,782,783,                                           784,785,786,787,788,789,790,791,                                           792,793,794,795,796,797,798,799,                                           800,801,802,803,804,805,806,807,                                           808,809,810,811,812,813,814,815,                                           816,817,818,819,820,821,822,823,                                           824,825,826,827,828,829,830,831,                                           832,833,834,835,836,837,838,839,                                           840,841,842,843,844,845,846,847,                                           848,849,850,851,852,853,854,855,                                           856,857,858,859,860,861,862,863,                                           864,865,866,867,868,869,870,871,                                           872,873,874,875,876,877,878,879,                                           880,881,882,883,884,885,886,887,                                           888,889,890,891,892,893,894,895,                                           896,897,898,899,900,901,902,903,                                           904,905,906,907,908,909,910,911,                                           912,913,914,915,916,917,918,919,                                           920,921,922,923,924,925,926,927,                                           928,929,930,931,932,933,934,935,                                           936,937,938,939,940,941,942,943,                                           944,945,946,947,948,949,950,951,                                           952,953,954,955,956,957,958,959,                                           960,961,962,963,964,965,966,967,                                           968,969,970,971,972,973,974,975,                                           976,977,978,979,980,981,982,983,                                           984,985,986,987,988,989,990,991,                                           992,993,994,995,996,997,998,999,                                           1000,1001,1002,1003,1004,1005,1006,                                           1007,1008,1009,1010,1011,1012,1013,                                           1014,1015,1016,1017,1018,1019,1020,                                           1022]},                                         infinity]}},                                      [{gen_server,call,3,                                        [{file,"gen_server.erl"},{line,385}]},                                       {dcp_replicator,                                        '-spawn_and_wait/1-fun-0-',1,                                        [{file,"src/dcp_replicator.erl"},                                         {line,313}]}]}}]},                                  [{misc,                                    sync_shutdown_many_i_am_trapping_exits,1,                                    [{file,"src/misc.erl"},{line,1517}]},                                   {dcp_replicator,spawn_and_wait,1,                                    [{file,"src/dcp_replicator.erl"},                                     {line,334}]},                                   {dcp_replicator,handle_call,3,                                    [{file,"src/dcp_replicator.erl"},                                     {line,146}]},                                   {gen_server,try_handle_call,4,                                    [{file,"gen_server.erl"},{line,1149}]},                                   {gen_server,handle_msg,6,                                    [{file,"gen_server.erl"},{line,1178}]},                                   {proc_lib,init_p_do_apply,3,                                    [{file,"proc_lib.erl"},{line,240}]}]},                                 {gen_server,call,                                  [<0.19269.0>,get_partitions,infinity]}},                                {gen_server,call,                                 ['dcp_replication_manager-default',                                  {get_replicator_pid,1006},                                  infinity]}},                               {gen_server,call,                                [{'janitor_agent-default',                                  'ns_1@172.23.121.71'},                                 {if_rebalance,<0.25989.0>,                                  {get_vbucket_high_seqno,428}},                                 infinity]}}}.Rebalance Operation Id = 540fae399ea7a85faf9aa838cac20c03 

      Observing this in ns_server.debug.log

      [error_logger:error,2023-11-20T00:30:17.707-08:00,ns_1@172.23.96.168:dcp_replication_manager-default<0.10274.0>:ale_error_logger_handler:do_log:101]=========================ERROR REPORT=========================** Generic server 'dcp_replication_manager-default' terminating ** Last message in was get_actual_replications** When Server state == "default"** Reason for termination ==** {{{shutdown,nuke},{gen_server,call,[<0.10387.0>,get_partitions,infinity]}},    [{gen_server,call,3,[{file,"gen_server.erl"},{line,385}]},     {dcp_replicator,get_partitions,1,                     [{file,"src/dcp_replicator.erl"},{line,167}]},     {dcp_replication_manager,'-handle_call/3-lc$^0/1-1-',1,                              [{file,"src/dcp_replication_manager.erl"},                               {line,86}]},     {dcp_replication_manager,handle_call,3,                              [{file,"src/dcp_replication_manager.erl"},                               {line,87}]},     {gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,1149}]},     {gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,1178}]},     {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}** Client ns_heart_slow_status_updater stacktrace** [{gen,do_call,4,[{file,"gen.erl"},{line,237}]},    {gen_server,call,3,[{file,"gen_server.erl"},{line,381}]},    {dcp_replication_manager,get_actual_replications,1,                             [{file,"src/dcp_replication_manager.erl"},                              {line,40}]},    {janitor_agent,this_node_replicator_triples,1,                   [{file,"src/janitor_agent.erl"},{line,374}]},    {failover_safeness_level,'-build_local_safeness_info/1-lc$^1/1-1-',1,                             [{file,"src/failover_safeness_level.erl"},                              {line,60}]},    {failover_safeness_level,build_local_safeness_info,1,                             [{file,"src/failover_safeness_level.erl"},                              {line,62}]},    {ns_heart,current_status_slow_inner,0,              [{file,"src/ns_heart.erl"},{line,301}]},    {ns_heart,current_status_slow,1,[{file,"src/ns_heart.erl"},{line,250}]}]
      [error_logger:error,2023-11-20T00:30:17.707-08:00,ns_1@172.23.96.168:dcp_replication_manager-default<0.10274.0>:ale_error_logger_handler:do_log:101]=========================CRASH REPORT=========================  crasher:    initial call: dcp_replication_manager:init/1    pid: <0.10274.0>    registered_name: 'dcp_replication_manager-default'    exception exit: {{shutdown,nuke},                     {gen_server,call,[<0.10387.0>,get_partitions,infinity]}}      in function  gen_server:call/3 (gen_server.erl, line 385)      in call from dcp_replicator:get_partitions/1 (src/dcp_replicator.erl, line 167)      in call from dcp_replication_manager:'-handle_call/3-lc$^0/1-1-'/1 (src/dcp_replication_manager.erl, line 86)      in call from dcp_replication_manager:handle_call/3 (src/dcp_replication_manager.erl, line 87)      in call from gen_server:try_handle_call/4 (gen_server.erl, line 1149)      in call from gen_server:handle_msg/6 (gen_server.erl, line 1178)    ancestors: ['single_bucket_kv_sup-default',ns_bucket_sup,                  ns_bucket_worker_sup,ns_server_sup,ns_server_nodes_sup,                  <0.8963.0>,ns_server_cluster_sup,root_sup,<0.155.0>]    message_queue_len: 0    messages: []    links: [<0.10260.0>]    dictionary: []    trap_exit: false    status: running    heap_size: 6772    stack_size: 28    reductions: 84658  neighbours:
       

      Also observing this in memcached logs

      2023-11-20T00:30:17.161297-08:00 INFO 583: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.121.71->ns_1@172.23.96.196:default - (vb:430) Backfill complete. 0 items consisting of 0 bytes read from disk, 449 items from memory, lastReadSeqno:890, lastSentSeqnoAdvance:0, lastSentSnapStartSeqno:0, lastSentSnapEndSeqno:890, pendingBackfill:False, numBackfillPauses:0. Total runtime 6811 us (0 item/s, 0 MB/s)2023-11-20T00:30:17.161479-08:00 INFO 583: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.121.71->ns_1@172.23.96.196:default - (vb:430) ActiveStream::transitionState: Transitioning from backfilling to in-memory2023-11-20T00:30:17.173304-08:00 WARNING 62: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.96.168->ns_1@172.23.121.71:default - vb:566 Disconnecting the connection as there is no memory to complete replication2023-11-20T00:30:17.173316-08:00 WARNING 62: [ {"ip":"127.0.0.1","port":37878} - {"ip":"127.0.0.1","port":11209} (System, @ns_server) ] dcp.snapshot_marker returned cb::engine_errc::disconnect2023-11-20T00:30:17.173335-08:00 INFO 62: (No Engine) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.96.168->ns_1@172.23.121.71:default - Removing connection [ {"ip":"127.0.0.1","port":37878} - {"ip":"127.0.0.1","port":11209} (System, @ns_server) ] 


       

      TAF Script to reproduce

      guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i node.ini upgrade_version=7.6.0-1822 -t rebalance_new.swaprebalancetests.SwapRebalanceBasicTests.do_test,nodes_init=2,replicas=1,standard_buckets=1,num-swap=1,sdk_retries=10,num_items=250000,durability=MAJORITY,bucket_type=ephemeral,sdk_timeout=50,get-cbcollect-info=True,infra_log_level=info,log_level=info,upgrade_version=7.6.0-1813,sirius_url=http://172.23.120.103:4000' 

      TAF ref job : debian-durability_rebalance_swap_ephemeral_majority_6.5_P0

      TAF job link : http://cb-logs-qe.s3-website-us-west-2.amazonaws.com/7.6.0-1813/jenkins_logs/test_suite_executor-TAF/287346/

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              raghav.sk Raghav S K
              raghav.sk Raghav S K
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty