Uploaded image for project: 'Couchbase Kubernetes'
  1. Couchbase Kubernetes
  2. K8S-1329

Removal of server class using NodePorts fails

    XMLWordPrintable

Details

    Description

      Steps to Reproduce

      • Create a cluster with multiple server groups and exposed features using NodePorts, for example:

          servers:
            - size: 1
              name: data
              services:
                - data
            - size: 1
              name: data2
              services:
                - data
        

      • Wait for Operator to setup the cluster.
      • Remove one of the server classes, e.g.:

          servers:
            - size: 1
              name: data2
              services:
                - data
        

      • Wait for the node to be removed.

      Expectation
      The pod is removed successfully.

      Actual Behavior
      The pod is never removed and the operator hangs trying to remove the pod:

      time="2020-02-11T15:33:15Z" level=info msg="Member cb-example-0001 is no longer part of any server config, removing" cluster-name=cb-example module=cluster
      

      Eventually timing out (10 minutes later) with an error like:

      time="2020-02-11T15:43:18Z" level=error msg="failed to reconcile: context deadline exceeded: Connection error - dial tcp 192.168.43.234:18091: connect: connection refused" cluster-name=cb-example module=cluster
      

      Notes
      The reason it is hanging is due to the node reachability check added in K8S-1084:

      goroutine 123 [select]:
      github.com/couchbase/couchbase-operator/pkg/util/netutil.WaitForHostPort(0x15d9020, 0xc00010bd40, 0xc0010badd0, 0x10, 0x0, 0x0)
              /home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/pkg/util/netutil/netutil.go:31 +0x19c
      github.com/couchbase/couchbase-operator/pkg/cluster.waitAlternateAddressReachable(0xc001508e20, 0x0, 0x0)
              /home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/pkg/cluster/reconcile.go:860 +0x1a7
      github.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).reconcileMemberAlternateAddresses(0xc00049bd40, 0x0, 0x0)
              /home/couchbase/jenkins/workspace/couchbase-operator-build/goproj/src/github.com/couchbase/couchbase-operator/pkg/cluster/reconcile.go:898 +0x182
      github.com/couchbase/couchbase-operator/pkg/cluster.handleNodeServices(0xc0007263c0, 0xc00049bd40, 0x10, 0xc00047d820)
      

      It looks like the problem is that we delete a reference to the node ports that the operator later needs to find the right node port to check.

      As a result it instead is checking the worker node's IP on port 18091 (instead of the actual nodeport).

      Workaround
      Do not use NodePorts for exposedFeatures.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              simon.murray Simon Murray
              matt.carabine Matt Carabine (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty