Uploaded image for project: 'Couchbase Kubernetes'
  1. Couchbase Kubernetes
  2. K8S-421

Ability to Balance in Nodes on Provision Failure

    XMLWordPrintable

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: not-targeted
    • Component/s: kubernetes
    • Labels:

      Description

      • Set antiaffinity to true in couchbase-cluster.yaml.
      • There is one master node and 2 worker nodes.

      [root@ip-172-31-7-110 couchbase-operator]# kubectl get nodes
      NAME                                         STATUS    ROLES     AGE       VERSION
      ip-172-31-1-197.us-east-2.compute.internal   Ready     <none>    57d       v1.10.2
      ip-172-31-6-25.us-east-2.compute.internal    Ready     <none>    57d       v1.10.2
      ip-172-31-7-110.us-east-2.compute.internal   Ready     master    57d       v1.10.2

      * There are three pods to be scheduled as mentioned in couchbase-cluster.yaml:

      servers:
          - size: 3
            name: all_services
            services:
              - data
              - index
              - query
              - search
              - eventing
              - analytics

      * Since anti-affinity is set to true, there is no worker node for the third pod to be scheduled and the third pod fails to be scheduled as expected. The logs also print out messages appropriately indicating this behavior:

      time="2018-06-29T05:24:57Z" level=info msg="Finish reconciling" cluster-name=cb-example module=cluster
      time="2018-06-29T05:24:57Z" level=error msg="failed to reconcile: Failed to add new node to cluster: unable to schedule pod: 0/3 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules." cluster-name=cb-example module=cluster

      * The first two pods are scheduled properly and are up and running:

      [root@ip-172-31-7-110 couchbase-operator]# kubectl get pods -o wide
      NAME                                  READY     STATUS    RESTARTS   AGE       IP          NODE
      cb-example-0000                       1/1       Running   0          12m       10.44.0.5   ip-172-31-6-25.us-east-2.compute.internal
      cb-example-0001                       1/1       Running   0          12m       10.36.0.2   ip-172-31-1-197.us-east-2.compute.internal
      couchbase-operator-5d7dfb795f-wthfr   1/1       Running   0          2d        10.36.0.1   ip-172-31-1-197.us-east-2.compute.internal

      * However, after logging into the UI, it appears that the operator left the cluster in an inconsistent state (pending rebalance) as shown below:

      Question: Since 2 pods out of 3 pods are scheduled successfully due to node availability, isn't the operator expected to manage those 2 pods correctly? Is the cluster expected to be left in this state?

        Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

          Hide
          simon.murray Simon Murray added a comment -

          We need to be very careful here, say we were to rename a server class, the second node add failed and we continued with the rebalance and ejection, we'd be down to one node and in a world of pain

          Given this potential for fail we should probably think very hard about this, and certainly defer to >1.0.0

          Show
          simon.murray Simon Murray added a comment - We need to be  very careful here, say we were to rename a server class, the second node add failed and we continued with the rebalance and ejection, we'd be down to one node and in a world of pain Given this potential for fail we should probably think very hard about this, and certainly defer to >1.0.0
          Hide
          simon.murray Simon Murray added a comment -

          Moving to the backlog as it's entirely too risky at the moment.

          Show
          simon.murray Simon Murray added a comment - Moving to the backlog as it's entirely too risky at the moment.
          Hide
          lynn.straus Lynn Straus added a comment -

          Per review in July 24 K8s meeting

          1. Simon Murray to consider adding message to ask customer to refer to log file if warranted.  This may be a seperate minor enhancement ticket.  Likely post-1.0.

          2. Simon Murray & Eric Schneider to look at documentation around scaling to see if should be enhanced and/or ask customers to refer to log files

          3. added releasenote label to this ticket

          4. future enhancement consideration to improve customer messaging regarding system status.  This is post-1.0.  Either use this ticket or clone to track the enhancement.

          Show
          lynn.straus Lynn Straus added a comment - Per review in July 24 K8s meeting 1. Simon Murray to consider adding message to ask customer to refer to log file if warranted.  This may be a seperate minor enhancement ticket.  Likely post-1.0. 2. Simon Murray & Eric Schneider to look at documentation around scaling to see if should be enhanced and/or ask customers to refer to log files 3. added releasenote label to this ticket 4. future enhancement consideration to improve customer messaging regarding system status.  This is post-1.0.  Either use this ticket or clone to track the enhancement.
          Hide
          mikew Mike Wiederhold [X] (Inactive) added a comment -

          Description for release notes:

          Known Issue: If a cluster is scaling up and not enough nodes are present the operator will not rebalance the cluster even if some nodes can be added. Users should always ensure sufficient resources are present in their Kubernetes cluster before scaling a cluster.

          Workaround: None.

          Show
          mikew Mike Wiederhold [X] (Inactive) added a comment - Description for release notes: Known Issue: If a cluster is scaling up and not enough nodes are present the operator will not rebalance the cluster even if some nodes can be added. Users should always ensure sufficient resources are present in their Kubernetes cluster before scaling a cluster. Workaround: None.
          Hide
          eric.schneider Eric Schneider (Inactive) added a comment -

          Description for release notes:

          Summary: Known Issue If a cluster is scaling up when not enough nodes are present, the Operator will not rebalance the cluster even if some nodes can be added.

          Workaround: You should ensure that sufficient resources are present in the Kubernetes cluster before scaling a Couchbase cluster.

          Show
          eric.schneider Eric Schneider (Inactive) added a comment - Description for release notes: Summary: Known Issue If a cluster is scaling up when not enough nodes are present, the Operator will not rebalance the cluster even if some nodes can be added. Workaround : You should ensure that sufficient resources are present in the Kubernetes cluster before scaling a Couchbase cluster.
          Hide
          simon.murray Simon Murray added a comment -

          I'm calling this not a bug, and more of a feature to allow a half-working cluster, possibly.

          Show
          simon.murray Simon Murray added a comment - I'm calling this not a bug, and more of a feature to allow a half-working cluster, possibly.

            People

            Assignee:
            simon.murray Simon Murray
            Reporter:
            sindhura.palakodety Sindhura Palakodety (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Dates

              Created:
              Updated:

                Gerrit Reviews

                There are no open Gerrit changes

                  PagerDuty