Description
Over the weekend got a timeout
04:23:08.045 util.go:692: 2019-03-08 02:08:29.872816221 -0800 PST m=+1871.512206167 Cluster healthy
|
04:23:08.045 util.go:1136: context deadline exceeded: context deadline exceeded: error upgrading connection: pods "test-couchbase-zjfgw-0006" not found
|
04:23:08.045 util.go:1137: goroutine 2659 [running]:
|
04:23:08.045 runtime/debug.Stack(0xc000aea000, 0xc0009d77b0, 0x1)
|
04:23:08.045 /jenkins/workspace/operator-gke-p0/go/src/runtime/debug/stack.go:24 +0xb5
|
04:23:08.045 github.com/couchbase/couchbase-operator/test/e2e/e2eutil.Die(0xc000aea000, 0x1e0d840, 0xc00035c510)
|
04:23:08.045 /jenkins/workspace/operator-gke-p0/gopath/src/github.com/couchbase/couchbase-operator/test/e2e/e2eutil/util.go:1137 +0x88
|
04:23:08.045 github.com/couchbase/couchbase-operator/test/e2e/e2eutil.MustVerifyServices(0xc000aea000, 0xc00019cd20, 0xc0002c3000, 0xdf8475800, 0xc0006dc150, 0xc000676390, 0x1, 0x1)
|
04:23:08.045 /jenkins/workspace/operator-gke-p0/gopath/src/github.com/couchbase/couchbase-operator/test/e2e/e2eutil/couchbase_util.go:665 +0xc5
|
04:23:08.045 github.com/couchbase/couchbase-operator/test/e2e.TestSwapNodesBetweenServices(0xc000aea000)
|
04:23:08.045 /jenkins/workspace/operator-gke-p0/gopath/src/github.com/couchbase/couchbase-operator/test/e2e/cluster_test.go:746 +0x1fb5
|
04:23:08.045 github.com/couchbase/couchbase-operator/test/e2e/framework.RecoverDecorator.func1(0xc000aea000)
|
04:23:08.045 /jenkins/workspace/operator-gke-p0/gopath/src/github.com/couchbase/couchbase-operator/test/e2e/framework/test_util.go:517 +0x7b
|
04:23:08.045 testing.tRunner(0xc000aea000, 0xc000adf970)
|
04:23:08.045 /jenkins/workspace/operator-gke-p0/go/src/testing/testing.go:827 +0x163
|
04:23:08.045 created by testing.(*T).Run
|
04:23:08.045 /jenkins/workspace/operator-gke-p0/go/src/testing/testing.go:878 +0x651
|
The operator show pod 6 getting balanced out
time="2019-03-08T10:08:40Z" level=info msg="Creating a pod (test-couchbase-zjfgw-0007) running Couchbase enterprise-5.5.3" cluster-name=test-couchbase-zjfgw module=cluster
|
time="2019-03-08T10:08:58Z" level=info msg="added member (test-couchbase-zjfgw-0007)" cluster-name=test-couchbase-zjfgw module=cluster
|
time="2019-03-08T10:08:59Z" level=info msg="Rebalance progress: 0.000000" cluster-name=test-couchbase-zjfgw module=cluster
|
time="2019-03-08T10:09:03Z" level=info msg="Rebalance progress: 75.000000" cluster-name=test-couchbase-zjfgw module=cluster
|
time="2019-03-08T10:09:07Z" level=info msg="Rebalance progress: 75.000000" cluster-name=test-couchbase-zjfgw module=cluster
|
time="2019-03-08T10:09:11Z" level=info msg="Rebalance progress: 75.000000" cluster-name=test-couchbase-zjfgw module=cluster
|
time="2019-03-08T10:09:15Z" level=info msg="Rebalance progress: 75.000000" cluster-name=test-couchbase-zjfgw module=cluster
|
time="2019-03-08T10:09:19Z" level=info msg="Rebalance progress: 75.000000" cluster-name=test-couchbase-zjfgw module=cluster
|
time="2019-03-08T10:09:24Z" level=info msg="deleted pod (test-couchbase-zjfgw-0006)" cluster-name=test-couchbase-zjfgw module=cluster
|
time="2019-03-08T10:09:24Z" level=info msg="reconcile finished" cluster-name=test-couchbase-zjfgw module=cluster
|
In theory this race is due to the portforward upgrade taking over the 1 minute timeout period. This proactively seeks to add an aggressive 10s roundtrip timeout to the port forwarder so we can at least rotate the client a few times during the overall timeout period, at which point pod 6 will almost certainly not be there to be used.