Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-30789

Rebalance fails with reason "Got unexpected exit signal" by ns_server

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Test Blocker
    • Resolution: Won't Fix
    • 5.5.0
    • None
    • ns_server
    • Untriaged
    • Yes

    Description

      This ticket is related to K8S-512.Tested using couchbase-server docker image:

      enterprise-5.5.0: Pulling from couchbase/server
      Digest: sha256:5228ded10c8fca39e8cea48cd845130d97b3770dc50f336b4214dcb165faaeda
      

      Scenario:

       
      Created 1 nodes couchbase-cluster and added "default" bucketScale up couchbase-cluster to 5 nodes

      Rebalanced started with 5 node cluster and got rebalance failure.

      Please refer to the log files from the ticket K8S-512

       

      cbcollect_info_ns_1@test-couchbase-k2zd9-0000.test-couchbase-k2zd9.ashwin.svc_20180805-112200/ns_server.info.log:10441:[ns_server:error,2018-08-05T11:19:39.949Z,ns_1@test-couchbase-k2zd9-0000.test-couchbase-k2zd9.ashwin.svc:<0.19219.0>:ns_single_vbucket_mover:spawn_and_wait:105]Got unexpected exit signal {'EXIT',<0.19239.0>,
      cbcollect_info_ns_1@test-couchbase-k2zd9-0000.test-couchbase-k2zd9.ashwin.svc_20180805-112200/ns_server.error.log:854:[ns_server:error,2018-08-05T11:19:39.949Z,ns_1@test-couchbase-k2zd9-0000.test-couchbase-k2zd9.ashwin.svc:<0.19219.0>:ns_single_vbucket_mover:spawn_and_wait:105]Got unexpected exit signal {'EXIT',<0.19239.0>,
      cbcollect_info_ns_1@test-couchbase-k2zd9-0000.test-couchbase-k2zd9.ashwin.svc_20180805-112200/ns_server.debug.log:35469:[ns_server:error,2018-08-05T11:19:39.949Z,ns_1@test-couchbase-k2zd9-0000.test-couchbase-k2zd9.ashwin.svc:<0.19219.0>:ns_single_vbucket_mover:spawn_and_wait:105]Got unexpected exit signal {'EXIT',<0.19239.0>,

      NS_Server error json:

       

      [ns_server:error,2018-08-05T11:19:39.949Z,ns_1@test-couchbase-k2zd9-0000.test-couchbase-k2zd9.ashwin.svc:<0.19219.0>:ns_single_vbucket_mover:spawn_and_wait:105]Got unexpected exit signal {'EXIT',<0.19239.0>,
                                  {bulk_set_vbucket_state_failed,
                                   [{'ns_1@test-couchbase-k2zd9-0002.test-couchbase-k2zd9.ashwin.svc',
                                     {'EXIT',
                                      {{{{{case_clause,
                                           {error,
                                            {{{badmatch,
                                               {error,
                                                {{badmatch,{error,nxdomain}},
                                                 [{dcp_proxy,connect,5,
                                                   [{file,"src/dcp_proxy.erl"},
                                                    {line,228}]},
                                                  {dcp_proxy,maybe_connect,2,
                                                   [{file,"src/dcp_proxy.erl"},
                                                    {line,210}]},
                                                  {dcp_consumer_conn,init,2,
                                                   [{file,
                                                     "src/dcp_consumer_conn.erl"},
                                                    {line,57}]},
                                                  {dcp_proxy,init,1,
                                                   [{file,"src/dcp_proxy.erl"},
                                                    {line,57}]},
                                                  {gen_server,init_it,6,
                                                   [{file,"gen_server.erl"},
                                                    {line,304}]},
                                                  {proc_lib,init_p_do_apply,3,
                                                   [{file,"proc_lib.erl"},
                                                    {line,239}]}]}}},
                                              [{dcp_replicator,init,1,
                                                [{file,"src/dcp_replicator.erl"},
                                                 {line,48}]},
                                               {gen_server,init_it,6,
                                                [{file,"gen_server.erl"},
                                                 {line,304}]},
                                               {proc_lib,init_p_do_apply,3,
                                                [{file,"proc_lib.erl"},
                                                 {line,239}]}]},
                                             {child,undefined,
                                              {'ns_1@test-couchbase-k2zd9-0000.test-couchbase-k2zd9.ashwin.svc',
                                               [del_times,snappy,xattr]},
                                              {dcp_replicator,start_link,
                                               ['ns_1@test-couchbase-k2zd9-0000.test-couchbase-k2zd9.ashwin.svc',
                                                "default",
                                                [del_times,snappy,xattr]]},
                                              temporary,60000,worker,
                                              [dcp_replicator]}}}},
                                          [{dcp_sup,start_replicator,2,
                                            [{file,"src/dcp_sup.erl"},{line,57}]},
                                           {dcp_sup,
                                            '-manage_replicators/2-lc$^3/1-3-',2,
                                            [{file,"src/dcp_sup.erl"},{line,94}]},
                                           {dcp_replication_manager,handle_call,3,
                                            [{file,
                                              "src/dcp_replication_manager.erl"},
                                             {line,89}]},
                                           {gen_server,handle_msg,5,
                                            [{file,"gen_server.erl"},{line,585}]},
                                           {proc_lib,init_p_do_apply,3,
                                            [{file,"proc_lib.erl"},{line,239}]}]},
                                         {gen_server,call,
                                          ['dcp_replication_manager-default',
                                           {manage_replicators,
                                            ['ns_1@test-couchbase-k2zd9-0000.test-couchbase-k2zd9.ashwin.svc']},
                                           infinity]}},
                                        {gen_server,call,
                                         ['replication_manager-default',
                                          {change_vbucket_replication,592,
                                           'ns_1@test-couchbase-k2zd9-0000.test-couchbase-k2zd9.ashwin.svc'},
                                          infinity]}},
                                       {gen_server,call,
                                        [{'janitor_agent-default',
                                          'ns_1@test-couchbase-k2zd9-0002.test-couchbase-k2zd9.ashwin.svc'},
                                         {if_rebalance,<0.12269.0>,
                                          {update_vbucket_state,592,pending,
                                           passive,
                                           'ns_1@test-couchbase-k2zd9-0000.test-couchbase-k2zd9.ashwin.svc'}},
                                         infinity]}}}}]}}

       

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            dfinlay Dave Finlay added a comment - - edited

            The rebalance failure is caused by a DNS lookup failure on node 0002. There are also problems in general with the network connectivity.

            The rebalance starts:

            user:info,2018-08-05T11:18:50.041Z,ns_1@test-couchbase-k2zd9-0000.test-couchbase-k2zd9.ashwin.svc:<0.764.0>:ns_orchestrator:idle:663]Starting rebalance, KeepNodes = ['ns_1@test-couchbase-k2zd9-0000.test-couchbase-k2zd9.ashw\
            in.svc',
                                             'ns_1@test-couchbase-k2zd9-0001.test-couchbase-k2zd9.ashwin.svc',
                                             'ns_1@test-couchbase-k2zd9-0002.test-couchbase-k2zd9.ashwin.svc',
                                             'ns_1@test-couchbase-k2zd9-0003.test-couchbase-k2zd9.ashwin.svc',
                                             'ns_1@test-couchbase-k2zd9-0004.test-couchbase-k2zd9.ashwin.svc'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes
            

            Prior to this nodes 0001, 0002, 0003 and 0004 were added to the cluster. This is the first rebalance to build out the full cluster. However, it's immediately clear there are problems. Nodes 0001, 0002, 0003 and 0004 are reported as not being up by the auto-failover logic and the countdown to failover is begun:

            ns_server:debug,2018-08-05T11:18:51.599Z,ns_1@test-couchbase-k2zd9-0000.test-couchbase-k2zd9.ashwin.svc:<0.765.0>:auto_failover:log_down_nodes_reason:382]Node 'ns_1@test-couchbase-k2zd9-0004.test-couchbase-k2zd9.ashwin.svc'\
             is considered down. Reason:"The data service did not respond for the duration of the auto-failover threshold. Either none of the buckets have warmed up or there is an issue with the data service. "
            [ns_server:debug,2018-08-05T11:18:51.599Z,ns_1@test-couchbase-k2zd9-0000.test-couchbase-k2zd9.ashwin.svc:<0.765.0>:auto_failover:log_down_nodes_reason:382]Node 'ns_1@test-couchbase-k2zd9-0003.test-couchbase-k2zd9.ashwin.svc'\
             is considered down. Reason:"The data service did not respond for the duration of the auto-failover threshold. Either none of the buckets have warmed up or there is an issue with the data service. "
            [ns_server:debug,2018-08-05T11:18:51.599Z,ns_1@test-couchbase-k2zd9-0000.test-couchbase-k2zd9.ashwin.svc:<0.765.0>:auto_failover:log_down_nodes_reason:382]Node 'ns_1@test-couchbase-k2zd9-0002.test-couchbase-k2zd9.ashwin.svc'\
             is considered down. Reason:"The data service did not respond for the duration of the auto-failover threshold. Either none of the buckets have warmed up or there is an issue with the data service. "
            [ns_server:debug,2018-08-05T11:18:51.600Z,ns_1@test-couchbase-k2zd9-0000.test-couchbase-k2zd9.ashwin.svc:<0.765.0>:auto_failover:log_down_nodes_reason:382]Node 'ns_1@test-couchbase-k2zd9-0001.test-couchbase-k2zd9.ashwin.svc'\
             is considered down. Reason:"The data service did not respond for the duration of the auto-failover threshold. Either none of the buckets have warmed up or there is an issue with the data service. "
            

            The nodes continue to flicker in and out in terms of being able to contact each other. Note that these aren't DNS issues here - these are established connections between the nodes. But it seems clear that there are network connectivity problems.

            [ns_server:debug,2018-08-05T11:19:00.563Z,ns_1@test-couchbase-k2zd9-0000.test-couchbase-k2zd9.ashwin.svc:<0.765.0>:auto_failover_logic:log_master_activity:170]Transitioned node {'ns_1@test-couchbase-k2zd9-0002.test-couchbase\
            -k2zd9.ashwin.svc',
                                  <<"643cfa29cffc13ff8901eb7be80cc354">>} state half_down -> up
            ...
            ns_server:debug,2018-08-05T11:19:04.563Z,ns_1@test-couchbase-k2zd9-0000.test-couchbase-k2zd9.ashwin.svc:<0.765.0>:auto_failover:log_down_nodes_reason:382]Node 'ns_1@test-couchbase-k2zd9-0002.test-couchbase-k2zd9.ashwin.svc'\
             is considered down. Reason:"The cluster manager did not respond for the duration of the auto-failover threshold. 
            ...
            [ns_server:debug,2018-08-05T11:19:08.568Z,ns_1@test-couchbase-k2zd9-0000.test-couchbase-k2zd9.ashwin.svc:<0.765.0>:auto_failover_logic:log_master_activity:170]Transitioned node {'ns_1@test-couchbase-k2zd9-0002.test-couchbase\
            -k2zd9.ashwin.svc',
                                  <<"643cfa29cffc13ff8901eb7be80cc354">>} state half_down -> up
            

            The rebalance continues to make some progress, but eventually the network flakiness causes it to fail

            rebalance:debug,2018-08-05T11:19:39.948Z,ns_1@test-couchbase-k2zd9-0000.test-couchbase-k2zd9.ashwin.svc:<0.19239.0>:janitor_agent:bulk_set_vbucket_state:399]bulk vbucket state change failed for:
            [{'ns_1@test-couchbase-k2zd9-0002.test-couchbase-k2zd9.ashwin.svc',
                 {'EXIT',
                     {{{{{case_clause,
                             {error,
                                 {{{badmatch,
                                       {error,
                                           {{badmatch,{error,nxdomain}},
                                            [{dcp_proxy,connect,5,
                                                 [{file,"src/dcp_proxy.erl"},{line,228}]},
                                             {dcp_proxy,maybe_connect,2,
            

            Here we try open a new connection and the hostname test-couchbase-k2zd9-0002.test-couchbase-k2zd9.ashwin.svc can't be resolved - a DNS problem.

            These look like pretty serious DNS and networking problems. We won't be able to assume that we'll get successful rebalances until we have more solid networking.

            dfinlay Dave Finlay added a comment - - edited The rebalance failure is caused by a DNS lookup failure on node 0002. There are also problems in general with the network connectivity. The rebalance starts: user:info,2018-08-05T11:18:50.041Z,ns_1@test-couchbase-k2zd9-0000.test-couchbase-k2zd9.ashwin.svc:<0.764.0>:ns_orchestrator:idle:663]Starting rebalance, KeepNodes = ['ns_1@test-couchbase-k2zd9-0000.test-couchbase-k2zd9.ashw\ in.svc', 'ns_1@test-couchbase-k2zd9-0001.test-couchbase-k2zd9.ashwin.svc', 'ns_1@test-couchbase-k2zd9-0002.test-couchbase-k2zd9.ashwin.svc', 'ns_1@test-couchbase-k2zd9-0003.test-couchbase-k2zd9.ashwin.svc', 'ns_1@test-couchbase-k2zd9-0004.test-couchbase-k2zd9.ashwin.svc'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes Prior to this nodes 0001, 0002, 0003 and 0004 were added to the cluster. This is the first rebalance to build out the full cluster. However, it's immediately clear there are problems. Nodes 0001, 0002, 0003 and 0004 are reported as not being up by the auto-failover logic and the countdown to failover is begun: ns_server:debug,2018-08-05T11:18:51.599Z,ns_1@test-couchbase-k2zd9-0000.test-couchbase-k2zd9.ashwin.svc:<0.765.0>:auto_failover:log_down_nodes_reason:382]Node 'ns_1@test-couchbase-k2zd9-0004.test-couchbase-k2zd9.ashwin.svc'\ is considered down. Reason:"The data service did not respond for the duration of the auto-failover threshold. Either none of the buckets have warmed up or there is an issue with the data service. " [ns_server:debug,2018-08-05T11:18:51.599Z,ns_1@test-couchbase-k2zd9-0000.test-couchbase-k2zd9.ashwin.svc:<0.765.0>:auto_failover:log_down_nodes_reason:382]Node 'ns_1@test-couchbase-k2zd9-0003.test-couchbase-k2zd9.ashwin.svc'\ is considered down. Reason:"The data service did not respond for the duration of the auto-failover threshold. Either none of the buckets have warmed up or there is an issue with the data service. " [ns_server:debug,2018-08-05T11:18:51.599Z,ns_1@test-couchbase-k2zd9-0000.test-couchbase-k2zd9.ashwin.svc:<0.765.0>:auto_failover:log_down_nodes_reason:382]Node 'ns_1@test-couchbase-k2zd9-0002.test-couchbase-k2zd9.ashwin.svc'\ is considered down. Reason:"The data service did not respond for the duration of the auto-failover threshold. Either none of the buckets have warmed up or there is an issue with the data service. " [ns_server:debug,2018-08-05T11:18:51.600Z,ns_1@test-couchbase-k2zd9-0000.test-couchbase-k2zd9.ashwin.svc:<0.765.0>:auto_failover:log_down_nodes_reason:382]Node 'ns_1@test-couchbase-k2zd9-0001.test-couchbase-k2zd9.ashwin.svc'\ is considered down. Reason:"The data service did not respond for the duration of the auto-failover threshold. Either none of the buckets have warmed up or there is an issue with the data service. " The nodes continue to flicker in and out in terms of being able to contact each other. Note that these aren't DNS issues here - these are established connections between the nodes. But it seems clear that there are network connectivity problems. [ns_server:debug,2018-08-05T11:19:00.563Z,ns_1@test-couchbase-k2zd9-0000.test-couchbase-k2zd9.ashwin.svc:<0.765.0>:auto_failover_logic:log_master_activity:170]Transitioned node {'ns_1@test-couchbase-k2zd9-0002.test-couchbase\ -k2zd9.ashwin.svc', <<"643cfa29cffc13ff8901eb7be80cc354">>} state half_down -> up ... ns_server:debug,2018-08-05T11:19:04.563Z,ns_1@test-couchbase-k2zd9-0000.test-couchbase-k2zd9.ashwin.svc:<0.765.0>:auto_failover:log_down_nodes_reason:382]Node 'ns_1@test-couchbase-k2zd9-0002.test-couchbase-k2zd9.ashwin.svc'\ is considered down. Reason:"The cluster manager did not respond for the duration of the auto-failover threshold. ... [ns_server:debug,2018-08-05T11:19:08.568Z,ns_1@test-couchbase-k2zd9-0000.test-couchbase-k2zd9.ashwin.svc:<0.765.0>:auto_failover_logic:log_master_activity:170]Transitioned node {'ns_1@test-couchbase-k2zd9-0002.test-couchbase\ -k2zd9.ashwin.svc', <<"643cfa29cffc13ff8901eb7be80cc354">>} state half_down -> up The rebalance continues to make some progress, but eventually the network flakiness causes it to fail rebalance:debug,2018-08-05T11:19:39.948Z,ns_1@test-couchbase-k2zd9-0000.test-couchbase-k2zd9.ashwin.svc:<0.19239.0>:janitor_agent:bulk_set_vbucket_state:399]bulk vbucket state change failed for: [{'ns_1@test-couchbase-k2zd9-0002.test-couchbase-k2zd9.ashwin.svc', {'EXIT', {{{{{case_clause, {error, {{{badmatch, {error, {{badmatch,{error,nxdomain}}, [{dcp_proxy,connect,5, [{file,"src/dcp_proxy.erl"},{line,228}]}, {dcp_proxy,maybe_connect,2, Here we try open a new connection and the hostname test-couchbase-k2zd9-0002.test-couchbase-k2zd9.ashwin.svc can't be resolved - a DNS problem. These look like pretty serious DNS and networking problems. We won't be able to assume that we'll get successful rebalances until we have more solid networking.
            dfinlay Dave Finlay added a comment -

            Marking as won't fix as this isn't a bug. The networking is the problem.

            dfinlay Dave Finlay added a comment - Marking as won't fix as this isn't a bug. The networking is the problem.

            Closing this since it relates to k8s network issue

            ashwin.govindarajulu Ashwin Govindarajulu added a comment - Closing this since it relates to k8s network issue

            People

              dfinlay Dave Finlay
              ashwin.govindarajulu Ashwin Govindarajulu
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty