Uploaded image for project: 'Couchbase Kubernetes'
  1. Couchbase Kubernetes
  2. K8S-889

Rebalance exited with reason {mover_crashed, {unexpected_exit, {'EXIT',<0.19885.4>,

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • None
    • None
    • openshift
    • CB operator 1.1 running on Openshift 4.0 in AWS

    Description

      Setup CB operator 1.1 running on Openshift 4.0 in AWS

      Configuration is attached as screenshot

      9 Data nodes

      3 Analytics node + (Index+Query) Nodes

      3 Index + Query Nodes

       

      Delete a data pod, rebalance failed with error

       

      Rebalance exited with reason {mover_crashed, {unexpected_exit, {'EXIT',<0.19885.4>, {bulk_set_vbucket_state_failed, [{'ns_1@cb-k8s-tls-demo-0015.cb-k8s-tls-demo.default.svc', {'EXIT', {{{{{case_clause, {error, {{{badmatch, {error, {{badmatch,{error,nxdomain}}, [{dcp_proxy,connect,5, [{file,"src/dcp_proxy.erl"}, {line,228}]}, {dcp_proxy,maybe_connect,2, [{file,"src/dcp_proxy.erl"}, {line,210}]}, {dcp_consumer_conn,init,2, [{file, "src/dcp_consumer_conn.erl"}, {line,57}]}, {dcp_proxy,init,1, [{file,"src/dcp_proxy.erl"}, {line,57}]}, {gen_server,init_it,6, [{file,"gen_server.erl"}, {line,304}]}, {proc_lib,init_p_do_apply,3, [{file,"proc_lib.erl"}, {line,239}]}]}}}, [{dcp_replicator,init,1, [{file,"src/dcp_replicator.erl"}, {line,48}]}, {gen_server,init_it,6, [{file,"gen_server.erl"}, {line,304}]}, {proc_lib,init_p_do_apply,3, [{file,"proc_lib.erl"}, {line,239}]}]}, {child,undefined, {'ns_1@cb-k8s-tls-demo-0003.cb-k8s-tls-demo.default.svc', [del_times,snappy,xattr]}, {dcp_replicator,start_link, ['ns_1@cb-k8s-tls-demo-0003.cb-k8s-tls-demo.default.svc', "tweets", [del_times,snappy,xattr]]}, temporary,60000,worker, [dcp_replicator]}}}}, [{dcp_sup,start_replicator,2, [{file,"src/dcp_sup.erl"}, {line,57}]}, {dcp_sup, '-manage_replicators/2-lc$^3/1-3-', 2, [{file,"src/dcp_sup.erl"}, {line,94}]}, {dcp_replication_manager, handle_call,3, [{file, "src/dcp_replication_manager.erl"}, {line,89}]}, {gen_server,handle_msg,5, [{file,"gen_server.erl"}, {line,585}]}, {proc_lib,init_p_do_apply,3, [{file,"proc_lib.erl"}, {line,239}]}]}, {gen_server,call, ['dcp_replication_manager-tweets', {manage_replicators, ['ns_1@cb-k8s-tls-demo-0003.cb-k8s-tls-demo.default.svc', 'ns_1@cb-k8s-tls-demo-0004.cb-k8s-tls-demo.default.svc', 'ns_1@cb-k8s-tls-demo-0006.cb-k8s-tls-demo.default.svc', 'ns_1@cb-k8s-tls-demo-0008.cb-k8s-tls-demo.default.svc']}, infinity]}}, {gen_server,call, ['replication_manager-tweets', {change_vbucket_replication,454, 'ns_1@cb-k8s-tls-demo-0003.cb-k8s-tls-demo.default.svc'}, infinity]}}, {gen_server,call, [{'janitor_agent-tweets', 'ns_1@cb-k8s-tls-demo-0015.cb-k8s-tls-demo.default.svc'}, {if_rebalance,<0.19800.4>, {update_vbucket_state,454,replica, undefined, 'ns_1@cb-k8s-tls-demo-0003.cb-k8s-tls-demo.default.svc'}}, infinity]}}}}]}}}} hide

       

       

       

      Attachments

        For Gerrit Dashboard: K8S-889
        # Subject Branch Project Status CR V

        Activity

          Assinging to Simon Murray for taking a look from k8s angle.

          ram.dhakne Ram Dhakne (Inactive) added a comment - Assinging to Simon Murray for taking a look from k8s angle.
          simon.murray Simon Murray added a comment -

          I'm in total agreement, this is purely a DNS issue and nothing to do with server or the operator.

          I have in the past asked for server to do a couple retries in case it is a transient error, but I got a no

          That said, while server will not retry, we will.  As such does the cluster get into a state of being consistently broken or does it eventually become healthy again?  If it's the latter everything is working as designed.

          simon.murray Simon Murray added a comment - I'm in total agreement, this is purely a DNS issue and nothing to do with server or the operator. I have in the past asked for server to do a couple retries in case it is a transient error, but I got a no That said, while server will not retry, we will.  As such does the cluster get into a state of being consistently broken or does it eventually become healthy again?  If it's the latter everything is working as designed.

          if its a transient error and we are handling it, then why is UI throwing rebalance failed?

          rebalance failure is treated very seriously, and freaks out DBA. rebalance still run, and finished successfully.

          ram.dhakne Ram Dhakne (Inactive) added a comment - if its a transient error and we are handling it, then why is UI throwing rebalance failed? rebalance failure is treated very seriously, and freaks out DBA. rebalance still run, and finished successfully.

          Ram Dhakne: Historically, much of the cluster manager expects the environment to be stable before carrying out something like rebalance. Not just with K8s, but in general we've had concern when failures happen transiently and there is an improvement planned for Mad Hatter. Let me see what more can be done in the near term.

          ingenthr Matt Ingenthron added a comment - Ram Dhakne : Historically, much of the cluster manager expects the environment to be stable before carrying out something like rebalance. Not just with K8s, but in general we've had concern when failures happen transiently and there is an improvement planned for Mad Hatter. Let me see what more can be done in the near term.
          simon.murray Simon Murray added a comment -

          Platform issue

          simon.murray Simon Murray added a comment - Platform issue

          People

            ram.dhakne Ram Dhakne (Inactive)
            ram.dhakne Ram Dhakne (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty