Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-60320

[Upgrade] : Rebalance exited with reason {prepare_rebalance_failed,{error,{failed_nodes,[{'ns_1@172.23.121.176',{error,timeout}}]}}}

    XMLWordPrintable

Details

    • Bug
    • Resolution: User Error
    • Major
    • None
    • 7.6.0
    • ns_server
    • Operating System : Debian GNU/Linux 10 (buster)
      Initial Version : Couchbase Enterprise Edition build 7.1.5-3878
      Upgraded Version : Couchbase Enterprise Edition build 7.6.0-1980

    Description

      Steps to reproduce

      1. Created a 4 node cluster with the following services
        1. 172.23.108.224 - kv
        2. 172.23.108.248 - kv, index, n1ql
        3. 172.23.121.158 - kv, index, fts
        4. 172.23.121.176 - kv, index
      2. Created a couchstore bucket named default and loaded some items onto it
      3. Ran a few n1ql and search queries 
      4. Gracefully failed over node 172.23.121.158
      5. Upgraded to Couchbase Enterprise Edition build 7.6.0-1980
      6. Added back using full recovery and started a rebalance - Rebalance succeeds
      7. Gracefully failed over node 172.23.121.176 
      8. Upgraded node to Couchbase Enterprise Edition build 7.6.0-1980
      9. Added back using full recovery
      10. Rebalance fails

      Rebalance fails
       

      2024-01-09T00:11:07.860-08:00, ns_orchestrator:0:critical:message(ns_1@172.23.121.158) - Rebalance exited with reason {prepare_rebalance_failed,                              {error,                               {failed_nodes,                                [{'ns_1@172.23.121.176',{error,timeout}}]}}}.Rebalance Operation Id = 9195d476d8c4f961b66ed05d9d3192b9 

      Observing no diag logs on node 172.23.121.176 

      curl: (7) Failed to connect to 127.0.0.1 port 80 after 0 ms: Couldn't connect to server 

       Observing ERROR REPORTS and CRASH REPORTS in babysitter logs on node 172.23.121.176 

      [ns_server:info,2024-01-09T00:01:03.729-08:00,babysitter_of_ns_1@cb.local:<0.133.0>:ns_port_server:log:226]ns_server<0.133.0>: 10278: Booted. Waiting for shutdown requestns_server<0.133.0>: 10278: Booted. Waiting for shutdown request
      [ns_server:debug,2024-01-09T00:09:34.325-08:00,babysitter_of_ns_1@cb.local:goxdcr-goport<0.152.0>:goport:handle_eof:592]Stream 'stdout' closed[ns_server:debug,2024-01-09T00:09:34.325-08:00,babysitter_of_ns_1@cb.local:goxdcr-goport<0.152.0>:goport:handle_eof:592]Stream 'stderr' closed[ns_server:info,2024-01-09T00:09:34.327-08:00,babysitter_of_ns_1@cb.local:goxdcr-goport<0.152.0>:goport:handle_process_exit:573]Port exited with status 1.[error_logger:error,2024-01-09T00:09:34.330-08:00,babysitter_of_ns_1@cb.local:<0.146.0>:ale_error_logger_handler:do_log:101]=========================ERROR REPORT=========================** Generic server <0.146.0> terminating ** Last message in was {<0.152.0>,{exit_status,1}}** When Server state == {state,<0.152.0>,10459,                         {goxdcr,"/opt/couchbase/bin/goxdcr",                          ["-sourceKVAdminPort=8091","-xdcrRestPort=9998",                           "-isEnterprise=true","-ipv4=required",                           "-ipv6=optional",                           "-caFile=/opt/couchbase/var/lib/couchbase/config/certs/ca.pem",                           "-clientCertFile=/opt/couchbase/var/lib/couchbase/config/certs/client_chain.pem",                           "-clientKeyFile=/opt/couchbase/var/lib/couchbase/config/certs/client_pkey.pem"],                          [via_goport,exit_status,stderr_to_stdout,                           {env,                            [{"GOTRACEBACK","single"},                             {"GOMAXPROCS","4"},                             {"CBAUTH_REVRPC_URL",                              {dynamic,                               {ns_ports_setup,                                remote_build_cbauth_revrpc_url,                                [goxdcr]}}}]}]},                         {ringbuffer,1488,1024,                          {[{<<"2024-01-09T00:09:34.321-08:00 WARN GOXDCR.Utils: GetAllMetadataFromCatalog(remoteCluster) took 15.501958857s\nMetadata service not available after 30"...>>,                             160},                            {<<"2024-01-09T00:09:34.321-08:00 WARN GOXDCR.MetadataSvc: metakv.ListAllChildren failed. path=/remoteCluster/, err=Get \"http://127.0.0.1:8091/_meta"...>>,                             354},                            {<<"2024-01-09T00:09:26.321-08:00 WARN GOXDCR.MetadataSvc: metakv.ListAllChildren failed. path=/remoteCluster/, err=Get \"http://127.0.0.1:8091/_"...>>,                             487}],                           [{<<"2024-01-09T00:09:22.320-08:00 WARN GOXDCR.MetadataSvc: metakv.ListAllChildren failed. path=/remoteCluster/, err=Get \"http://127.0.0.1:8091/_meta"...>>,                             487}]}},                         goxdcr,undefined,[],0}** Reason for termination ==** {abnormal,1}
      [error_logger:error,2024-01-09T00:09:34.332-08:00,babysitter_of_ns_1@cb.local:<0.146.0>:ale_error_logger_handler:do_log:101]=========================CRASH REPORT=========================  crasher:    initial call: ns_port_server:init/1    pid: <0.146.0>    registered_name: []    exception exit: {abnormal,1}      in function  gen_server:handle_common_reply/8 (gen_server.erl, line 1241)    ancestors: [<0.145.0>,<0.144.0>,ns_child_ports_sup,ns_babysitter_sup,                  <0.115.0>]    message_queue_len: 1    messages: [{'EXIT',<0.152.0>,normal}]    links: [<0.145.0>]    dictionary: []    trap_exit: true    status: running    heap_size: 10958    stack_size: 28    reductions: 60089  neighbours:
      [ns_server:info,2024-01-09T00:09:34.332-08:00,babysitter_of_ns_1@cb.local:<0.145.0>:supervisor_cushion:handle_info:66]Cushion managed supervisor for goxdcr failed:  {abnormal,1}[error_logger:error,2024-01-09T00:09:34.333-08:00,babysitter_of_ns_1@cb.local:<0.145.0>:ale_error_logger_handler:do_log:101]=========================ERROR REPORT=========================** Generic server <0.145.0> terminating ** Last message in was {die,{abnormal,1}}** When Server state == {state,goxdcr,5000,-576460740445204035,undefined,                               infinity,                               #{always_delay => false}}** Reason for termination ==** {abnormal,1}
      [error_logger:error,2024-01-09T00:09:34.333-08:00,babysitter_of_ns_1@cb.local:<0.145.0>:ale_error_logger_handler:do_log:101]=========================CRASH REPORT=========================  crasher:    initial call: supervisor_cushion:init/1    pid: <0.145.0>    registered_name: []    exception exit: {abnormal,1}      in function  gen_server:handle_common_reply/8 (gen_server.erl, line 1241)    ancestors: [<0.144.0>,ns_child_ports_sup,ns_babysitter_sup,<0.115.0>]    message_queue_len: 0    messages: []    links: [<0.144.0>]    dictionary: []    trap_exit: true    status: running    heap_size: 6772    stack_size: 28    reductions: 11729  neighbours:
      [error_logger:error,2024-01-09T00:09:34.333-08:00,babysitter_of_ns_1@cb.local:<0.144.0>:ale_error_logger_handler:do_log:101]=========================CRASH REPORT=========================  crasher:    initial call: erlang:apply/2    pid: <0.144.0>    registered_name: []    exception exit: {abnormal,1}      in function  restartable:loop/4 (src/restartable.erl, line 63)    ancestors: [ns_child_ports_sup,ns_babysitter_sup,<0.115.0>]    message_queue_len: 0    messages: []    links: [<0.134.0>]    dictionary: []    trap_exit: true    status: running    heap_size: 2586    stack_size: 28    reductions: 3083  neighbours:
      [error_logger:error,2024-01-09T00:09:34.333-08:00,babysitter_of_ns_1@cb.local:ns_child_ports_sup<0.134.0>:ale_error_logger_handler:do_log:101]=========================SUPERVISOR REPORT=========================    supervisor: {local,ns_child_ports_sup}    errorContext: child_terminated    reason: {abnormal,1}    offender: [{pid,<0.144.0>},               {id,{goxdcr,"/opt/couchbase/bin/goxdcr",                       ["-sourceKVAdminPort=8091","-xdcrRestPort=9998",                        "-isEnterprise=true","-ipv4=required",                        "-ipv6=optional",                        "-caFile=/opt/couchbase/var/lib/couchbase/config/certs/ca.pem",                        "-clientCertFile=/opt/couchbase/var/lib/couchbase/config/certs/client_chain.pem",                        "-clientKeyFile=/opt/couchbase/var/lib/couchbase/config/certs/client_pkey.pem"],                       [via_goport,exit_status,stderr_to_stdout,                        {env,                            [{"GOTRACEBACK","single"},                             {"GOMAXPROCS","4"},                             {"CBAUTH_REVRPC_URL",                              {dynamic,                                  {ns_ports_setup,                                      remote_build_cbauth_revrpc_url,                                      [goxdcr]}}}]},                        {log,"goxdcr.log"}]}},               {mfargs,                   {restartable,start_link,                       [{supervisor_cushion,start_link,                            [goxdcr,5000,infinity,ns_port_server,start_link,                             [#Fun<ns_child_ports_sup.2.17249174>]]},                        86400000]}},               {restart_type,permanent},               {significant,false},               {shutdown,infinity},               {child_type,worker}]
      [ns_server:debug,2024-01-09T00:09:34.333-08:00,babysitter_of_ns_1@cb.local:<0.397.0>:supervisor_cushion:init:39]Starting supervisor cushion for goxdcr with delay of 5000[error_logger:info,2024-01-09T00:09:34.336-08:00,babysitter_of_ns_1@cb.local:ale_dynamic_sup<0.96.0>:ale_error_logger_handler:do_log:101]=========================PROGRESS REPORT=========================    supervisor: {local,ale_dynamic_sup}    started: [{pid,<0.399.0>},              {id,'sink-goxdcr'},              {mfargs,                  {ale_dynamic_sup,delay_death,                      [{ale_disk_sink,start_link,                           ['sink-goxdcr',                            "/opt/couchbase/var/lib/couchbase/logs/goxdcr.log",                            [{rotation,                                 [{compress,true},                                  {size,41943040},                                  {num_files,10},                                  {buffer_size_max,52428800}]}]]},                       1000]}},              {restart_type,permanent},              {significant,false},              {shutdown,5000},              {child_type,worker}]
      [ns_server:debug,2024-01-09T00:09:34.585-08:00,babysitter_of_ns_1@cb.local:<0.396.0>:restartable:start_child:92]Started child process <0.397.0>  MFA: {supervisor_cushion,start_link,                           [goxdcr,5000,infinity,ns_port_server,start_link,                            [#Fun<ns_child_ports_sup.2.17249174>]]}[error_logger:info,2024-01-09T00:09:34.585-08:00,babysitter_of_ns_1@cb.local:ns_child_ports_sup<0.134.0>:ale_error_logger_handler:do_log:101]=========================PROGRESS REPORT=========================    supervisor: {local,ns_child_ports_sup}    started: [{pid,<0.396.0>},              {id,{goxdcr,"/opt/couchbase/bin/goxdcr",                      ["-sourceKVAdminPort=8091","-xdcrRestPort=9998",                       "-isEnterprise=true","-ipv4=required","-ipv6=optional",                       "-caFile=/opt/couchbase/var/lib/couchbase/config/certs/ca.pem",                       "-clientCertFile=/opt/couchbase/var/lib/couchbase/config/certs/client_chain.pem",                       "-clientKeyFile=/opt/couchbase/var/lib/couchbase/config/certs/client_pkey.pem"],                      [via_goport,exit_status,stderr_to_stdout,                       {env,                           [{"GOTRACEBACK","single"},                            {"GOMAXPROCS","4"},                            {"CBAUTH_REVRPC_URL",                             {dynamic,                                 {ns_ports_setup,                                     remote_build_cbauth_revrpc_url,                                     [goxdcr]}}}]},                       {log,"goxdcr.log"}]}},              {mfargs,                  {restartable,start_link,                      [{supervisor_cushion,start_link,                           [goxdcr,5000,infinity,ns_port_server,start_link,                            [#Fun<ns_child_ports_sup.2.17249174>]]},                       86400000]}},              {restart_type,permanent},              {significant,false},              {shutdown,infinity},              {child_type,worker}]
       

      Observing many "Accepted new connection" and "Connection down" in ns_server.debug.log on node 172.23.121.176 

      [ns_server:debug,2024-01-09T00:11:08.678-08:00,ns_1@cb.local:cb_dist<0.227.0>:cb_dist:info_msg:1098]cb_dist: Accepted new connection from <0.231.0> DistCtrl #Port<0.4201>: {con,                                                                         #Ref<0.2369029036.2466250753.8093>,                                                                         inet_tcp_dist,                                                                         undefined,                                                                         undefined}[ns_server:debug,2024-01-09T00:11:08.678-08:00,ns_1@cb.local:net_kernel<0.229.0>:cb_dist:info_msg:1098]cb_dist: Accepting connection from <0.231.0> using module inet_tcp_dist[ns_server:debug,2024-01-09T00:11:08.678-08:00,ns_1@cb.local:cb_dist<0.227.0>:cb_dist:info_msg:1098]cb_dist: Updated connection: {con,#Ref<0.2369029036.2466250753.8093>,                                  inet_tcp_dist,<0.32550.0>,                                  #Ref<0.2369029036.2466250753.8096>}[ns_server:debug,2024-01-09T00:11:08.678-08:00,ns_1@cb.local:cb_dist<0.227.0>:cb_dist:info_msg:1098]cb_dist: Connection down: {con,#Ref<0.2369029036.2466250753.8093>,                               inet_tcp_dist,<0.32550.0>,                               #Ref<0.2369029036.2466250753.8096>}[error_logger:info,2024-01-09T00:11:08.678-08:00,ns_1@cb.local:net_kernel<0.229.0>:ale_error_logger_handler:do_log:101]=========================NOTICE REPORT========================={net_kernel,{'EXIT',<0.32550.0>,{recv_challenge_reply_failed,{error,closed}}}}[error_logger:info,2024-01-09T00:11:08.679-08:00,ns_1@cb.local:net_kernel<0.229.0>:ale_error_logger_handler:do_log:101] 

      Observing "failed with error not_leader" in ns_server.debug.log on node 172.23.121.176 

      ns_server:error,2024-01-09T00:11:08.648-08:00,ns_1@cb.local:<0.31710.0>:leader_activities:report_error:944]Activity {default,{ns_janitor,["default"],cleanup}} failed with error not_leader[error_logger:error,2024-01-09T00:11:08.649-08:00,ns_1@cb.local:cleanup_process<0.31708.0>:ale_error_logger_handler:do_log:101]=========================CRASH REPORT=========================  crasher:    initial call: ns_janitor_server:run_cleanup/2    pid: <0.31708.0>    registered_name: cleanup_process    exception error: no match of right hand side value                      {leader_activities_error,                         {default,{ns_janitor,["default"],cleanup}},                         not_leader}      in function  ns_janitor:run_buckets_cleanup_activity/3 (src/ns_janitor.erl, line 85)      in call from ns_janitor:cleanup_buckets/2 (src/ns_janitor.erl, line 78)      in call from ns_janitor:cleanup/2 (src/ns_janitor.erl, line 41)      in call from ns_janitor_server:'-run_cleanup/2-fun-0-'/2 (src/ns_janitor_server.erl, line 180)      in call from lists:foldl_1/3 (lists.erl, line 1355)      in call from ns_janitor_server:run_cleanup/2 (src/ns_janitor_server.erl, line 178)    ancestors: [ns_janitor_server,ns_orchestrator_child_sup,                  ns_orchestrator_sup,mb_master_sup,mb_master,                  leader_registry_sup,leader_services_sup,<0.778.0>,                  ns_server_sup,ns_server_nodes_sup,<0.298.0>,                  ns_server_cluster_sup,root_sup,<0.155.0>]    message_queue_len: 0    messages: []    links: [<0.1378.0>]    dictionary: []    trap_exit: false    status: running    heap_size: 10958    stack_size: 28    reductions: 679  neighbours: 

       


      Testrunner script to reproduce


      ./testrunner -i /data/workspace/debian-p0-upgrade-vset00-00-new-with-services-from-715_2/testexec.47662.ini -p upgrade_version=7.6.0-1980,stop-on-failure=False,skip_cleanup=False,skip_init_check_cbserver=true,groups=simple,gsi_type=memory_optimized,upgrade_test=True,dgm_run=True,skip_log_scan=true,use_single_bucket=True,get-cbcollect-info=True,bucket_storage=couchstore,sirius_url=http://172.23.120.103:4000 -t newupgradetests.MultiNodesUpgradeTests.test_offline_upgrade_with_add_new_services,initial_version=7.1.5-3878,nodes_init=4,ddocs_num=3,pre_upgrade=create_replica_index,doc-per-day=1,dataset=default,after_upgrade_buckets_flush=True,groups=simple,initial-services-setting=kv-kv,n1ql,index-kv,index,after_upgrade_services_in=index,num_index_replica=2,disable_HTP=True,offline_failover_upgrade=True,upgrade_test=True,dgm_run=True,init_nodes=False

      Job name : debian-upgrade-new-with-services-from-715_2

      Job ref : http://cb-logs-qe.s3-website-us-west-2.amazonaws.com/7.6.0-1980/jenkins_logs/test_suite_executor/661520/

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              Abhijeeth.Nuthan Abhijeeth Nuthan
              raghav.sk Raghav S K
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty