Uploaded image for project: 'Couchbase Kubernetes'
  1. Couchbase Kubernetes
  2. K8S-615

Ephemeral pod with log PV: Seeing multiple rebalance failure when server pod is killed

    XMLWordPrintable

Details

    Description

      Operator image used 1.1.0-108

      Scenario:

      1. Create 3 member Cb cluster with log PV defined for all pods
      2. Killed Cb server pod 0001 using `kubectl delete pods/cb-example-0001`

      Observation:

      New pod 0003 was created to replace pod 0001. But before the pod 0001 to get removed and cluster undergone multiple rebalance failures.

      Operator console prints:

      time="2018-10-05T06:34:39Z" level=warning msg="cb-example-0001 is unrecoverable: No volume mounts defined" cluster-name=cb-example module=cluster
      time="2018-10-05T06:34:39Z" level=info msg="planning removal of http://cb-example-0001.cb-example.ashwin.svc:8091" cluster-name=cb-example module=cluster
      time="2018-10-05T06:34:41Z" level=warning msg="unable to poll external addresses for pod cb-example-0001" cluster-name=cb-example module=cluster
      time="2018-10-05T06:34:42Z" level=info msg="Rebalance progress: 0.000000" cluster-name=cb-example module=cluster
      time="2018-10-05T06:34:46Z" level=info msg="Rebalance progress: 0.000000" cluster-name=cb-example module=cluster
      time="2018-10-05T06:34:54Z" level=error msg="failed to reconcile: Failed to rebalance: cluster reports rebalance incomplete" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:02Z" level=info msg="server config all_services: cb-example-0000,cb-example-0002" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:02Z" level=info msg="Cluster status: unbalanced" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:02Z" level=info msg="Node status:" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:02Z" level=info msg="┌─────────────────┬──────────────┬─────────────────────┐" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:02Z" level=info msg="│ Server          │ Class        │ Status              │" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:02Z" level=info msg="├─────────────────┼──────────────┼─────────────────────┤" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:02Z" level=info msg="│ cb-example-0000 │ all_services │ managed+active      │" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:02Z" level=info msg="│ cb-example-0001 │ all_services │ managed+failed      │" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:02Z" level=info msg="│ cb-example-0002 │ all_services │ managed+active      │" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:02Z" level=info msg="│ cb-example-0003 │ all_services │ managed+pending_add │" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:02Z" level=info msg="└─────────────────┴──────────────┴─────────────────────┘" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:02Z" level=info cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:04Z" level=info msg="An auto-failover has taken place" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:04Z" level=warning msg="cb-example-0001 is unrecoverable: No volume mounts defined" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:04Z" level=info msg="planning removal of http://cb-example-0001.cb-example.ashwin.svc:8091" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:06Z" level=warning msg="unable to poll external addresses for pod cb-example-0001" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:07Z" level=info msg="Rebalance progress: 0.000000" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:11Z" level=info msg="Rebalance progress: 0.000000" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:19Z" level=error msg="failed to reconcile: Failed to rebalance: cluster reports rebalance incomplete" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:27Z" level=info msg="server config all_services: cb-example-0000,cb-example-0002" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:27Z" level=info msg="Cluster status: unbalanced" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:27Z" level=info msg="Node status:" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:27Z" level=info msg="┌─────────────────┬──────────────┬─────────────────────┐" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:27Z" level=info msg="│ Server          │ Class        │ Status              │" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:27Z" level=info msg="├─────────────────┼──────────────┼─────────────────────┤" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:27Z" level=info msg="│ cb-example-0000 │ all_services │ managed+active      │" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:27Z" level=info msg="│ cb-example-0001 │ all_services │ managed+failed      │" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:27Z" level=info msg="│ cb-example-0002 │ all_services │ managed+active      │" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:27Z" level=info msg="│ cb-example-0003 │ all_services │ managed+pending_add │" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:27Z" level=info msg="└─────────────────┴──────────────┴─────────────────────┘" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:27Z" level=info cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:29Z" level=info msg="An auto-failover has taken place" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:29Z" level=warning msg="cb-example-0001 is unrecoverable: No volume mounts defined" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:29Z" level=info msg="planning removal of http://cb-example-0001.cb-example.ashwin.svc:8091" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:32Z" level=warning msg="unable to poll external addresses for pod cb-example-0001" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:33Z" level=info msg="Rebalance progress: 0.000000" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:37Z" level=info msg="Rebalance progress: 0.000000" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:45Z" level=error msg="failed to reconcile: Failed to rebalance: cluster reports rebalance incomplete" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:53Z" level=info msg="server config all_services: cb-example-0000,cb-example-0002" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:53Z" level=info msg="Cluster status: unbalanced" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:53Z" level=info msg="Node status:" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:53Z" level=info msg="┌─────────────────┬──────────────┬─────────────────────┐" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:53Z" level=info msg="│ Server          │ Class        │ Status              │" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:53Z" level=info msg="├─────────────────┼──────────────┼─────────────────────┤" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:53Z" level=info msg="│ cb-example-0000 │ all_services │ managed+active      │" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:53Z" level=info msg="│ cb-example-0001 │ all_services │ managed+failed      │" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:53Z" level=info msg="│ cb-example-0002 │ all_services │ managed+active      │" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:53Z" level=info msg="│ cb-example-0003 │ all_services │ managed+pending_add │" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:53Z" level=info msg="└─────────────────┴──────────────┴─────────────────────┘" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:53Z" level=info cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:55Z" level=info msg="An auto-failover has taken place" cluster-name=cb-example module=cluster
      time="2018-10-05T06:35:55Z" level=warning msg="cb-example-0001 is unrecoverable: No volume mounts defined" cluster-name=cb-example module=cluster
      

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            simon.murray Simon Murray added a comment -

            Looks like it's not us as per usual

            [ns_server:error,2018-10-05T06:34:23.281Z,ns_1@cb-example-0000.cb-example.ashwin.svc:service_agent-cbas<0.25094.0>:service_agent:handle_info:239]Lost json rpc connection for service cbas, reason shutdown. Terminating.
            [ns_server:error,2018-10-05T06:34:23.281Z,ns_1@cb-example-0000.cb-example.ashwin.svc:service_agent-cbas<0.25094.0>:service_agent:terminate:260]Terminating abnormally
            [ns_server:error,2018-10-05T06:34:23.282Z,ns_1@cb-example-0000.cb-example.ashwin.svc:service_rebalancer-cbas<0.25109.0>:service_rebalancer:run_rebalance:82]Agent terminated during the rebalance: {'DOWN',#Ref<0.0.0.146596>,process,
                                                    <0.25094.0>,
                                                    {lost_connection,shutdown}}
            [ns_server:error,2018-10-05T06:34:23.284Z,ns_1@cb-example-0000.cb-example.ashwin.svc:service_agent-cbas<0.25490.0>:service_agent:handle_call:182]Got rebalance-only call {if_rebalance,<0.25109.0>,unset_rebalancer} that doesn't match rebalancer pid undefined
            [ns_server:error,2018-10-05T06:34:23.292Z,ns_1@cb-example-0000.cb-example.ashwin.svc:service_rebalancer-cbas<0.25109.0>:service_agent:process_bad_results:810]Service call unset_rebalancer (service cbas) failed on some nodes:
            [{'ns_1@cb-example-0000.cb-example.ashwin.svc',nack}]
            [ns_server:warn,2018-10-05T06:34:23.292Z,ns_1@cb-example-0000.cb-example.ashwin.svc:service_rebalancer-cbas<0.25109.0>:service_rebalancer:run_rebalance:91]Failed to unset rebalancer on some nodes:
            {error,{bad_nodes,cbas,unset_rebalancer,
                              [{'ns_1@cb-example-0000.cb-example.ashwin.svc',nack}]}}
            [user:error,2018-10-05T06:34:23.298Z,ns_1@cb-example-0000.cb-example.ashwin.svc:<0.710.0>:ns_orchestrator:do_log_rebalance_completion:1117]Rebalance exited with reason {badmatch,failed}

            Can you please review the logs in future and see if it's obviously us or server that's at fault, it only takes a couple minutes.

             

            simon.murray Simon Murray added a comment - Looks like it's not us as per usual [ns_server:error,2018-10-05T06:34:23.281Z,ns_1@cb-example-0000.cb-example.ashwin.svc:service_agent-cbas<0.25094.0>:service_agent:handle_info:239]Lost json rpc connection for service cbas, reason shutdown. Terminating. [ns_server:error,2018-10-05T06:34:23.281Z,ns_1@cb-example-0000.cb-example.ashwin.svc:service_agent-cbas<0.25094.0>:service_agent:terminate:260]Terminating abnormally [ns_server:error,2018-10-05T06:34:23.282Z,ns_1@cb-example-0000.cb-example.ashwin.svc:service_rebalancer-cbas<0.25109.0>:service_rebalancer:run_rebalance:82]Agent terminated during the rebalance: {'DOWN',#Ref<0.0.0.146596>,process, <0.25094.0>, {lost_connection,shutdown}} [ns_server:error,2018-10-05T06:34:23.284Z,ns_1@cb-example-0000.cb-example.ashwin.svc:service_agent-cbas<0.25490.0>:service_agent:handle_call:182]Got rebalance-only call {if_rebalance,<0.25109.0>,unset_rebalancer} that doesn't match rebalancer pid undefined [ns_server:error,2018-10-05T06:34:23.292Z,ns_1@cb-example-0000.cb-example.ashwin.svc:service_rebalancer-cbas<0.25109.0>:service_agent:process_bad_results:810]Service call unset_rebalancer (service cbas) failed on some nodes: [{'ns_1@cb-example-0000.cb-example.ashwin.svc',nack}] [ns_server:warn,2018-10-05T06:34:23.292Z,ns_1@cb-example-0000.cb-example.ashwin.svc:service_rebalancer-cbas<0.25109.0>:service_rebalancer:run_rebalance:91]Failed to unset rebalancer on some nodes: {error,{bad_nodes,cbas,unset_rebalancer, [{'ns_1@cb-example-0000.cb-example.ashwin.svc',nack}]}} [user:error,2018-10-05T06:34:23.298Z,ns_1@cb-example-0000.cb-example.ashwin.svc:<0.710.0>:ns_orchestrator:do_log_rebalance_completion:1117]Rebalance exited with reason {badmatch,failed} Can you please review the logs in future and see if it's obviously us or server that's at fault, it only takes a couple minutes.  

            Closing this issue since it is a server issue which is fixed for Couchbase 6.0.

            mikew Mike Wiederhold [X] (Inactive) added a comment - Closing this issue since it is a server issue which is fixed for Couchbase 6.0.

            People

              mikew Mike Wiederhold [X] (Inactive)
              ashwin.govindarajulu Ashwin Govindarajulu
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty