Details
Description
What
I noted when testing Operator 2.0.1 that XDCR in our sanity suite (e.g. should never fail) was throwing up an error. I had only changed security settings so nothing to do with XDCR which was immediately suspect.
Here's Operator 2.0.0 running against 6.5.0:
$ tco -t TestXdcrCreateCluster --server-image couchbase/server:6.5.0 -c gke_couchbase-engineering_us-east1_spjmurray -c gke_couchbase-engineering_us-west1_spjmurray -i couchbase/operator:2.0.0 -I couchbase/admission-controller:2.0.0
|
=== RUN TestOperator
|
=== RUN TestOperator/TestXdcrCreateCluster
|
PASS
|
--- PASS: TestOperator (312.74s)
|
--- PASS: TestOperator/TestXdcrCreateCluster (195.50s)
|
crd_util.go:26: creating couchbase cluster: test-couchbase-czzbs
|
crd_util.go:26: creating couchbase cluster: test-couchbase-724rs
|
test_util.go:35: Suite Test Results:
|
test_util.go:64: 1: TestXdcrCreateCluster...PASS
|
test_util.go:106:
|
Pass: 1.000000
|
Fail: 0.000000
|
Pass Rate: 100.000000
|
and against 6.5.1:
tco -t TestXdcrCreateCluster --server-image couchbase/server:6.5.1 -c gke_couchbase-engineering_us-east1_spjmurray -c gke_couchbase-engineering_us-west1_spjmurray -i couchbase/operator:2.0.0 -I couchbase/admission-controller:2.0.0
|
=== RUN TestOperator
|
=== RUN TestOperator/TestXdcrCreateCluster
|
FAIL
|
--- FAIL: TestOperator (891.61s)
|
--- FAIL: TestOperator/TestXdcrCreateCluster (781.25s)
|
crd_util.go:26: creating couchbase cluster: test-couchbase-f2sv5
|
crd_util.go:26: creating couchbase cluster: test-couchbase-vdkwf
|
util.go:1304: context deadline exceeded: document count 0, expected 10
|
util.go:1305: goroutine 531 [running]:
|
runtime/debug.Stack(0xc000580300, 0xc000b7dc50, 0x1)
|
/usr/local/go/src/runtime/debug/stack.go:24 +0xab
|
github.com/couchbase/couchbase-operator/test/e2e/e2eutil.Die(0xc000580300, 0x23481e0, 0xc000434a60)
|
/home/simon/go/src/github.com/couchbase/couchbase-operator/test/e2e/e2eutil/util.go:1305 +0x85
|
github.com/couchbase/couchbase-operator/test/e2e/e2eutil.MustVerifyDocCountInBucket(0xc000580300, 0xc000331680, 0xc000334d80, 0x20392c7, 0x7, 0xa, 0x8bb2c97000)
|
/home/simon/go/src/github.com/couchbase/couchbase-operator/test/e2e/e2eutil/xdcr_util.go:120 +0xb5
|
github.com/couchbase/couchbase-operator/test/e2e.TestXdcrCreateCluster(0xc000580300)
|
/home/simon/go/src/github.com/couchbase/couchbase-operator/test/e2e/xdcr_test.go:336 +0x784
|
github.com/couchbase/couchbase-operator/test/e2e/framework.RecoverDecorator.func1(0xc000580300)
|
/home/simon/go/src/github.com/couchbase/couchbase-operator/test/e2e/framework/test_util.go:347 +0x85
|
testing.tRunner(0xc000580300, 0xc00084e5b0)
|
/usr/local/go/src/testing/testing.go:909 +0x19a
|
created by testing.(*T).Run
|
/usr/local/go/src/testing/testing.go:960 +0x652
|
|
test_util.go:35: Suite Test Results:
|
test_util.go:67: 1: TestXdcrCreateCluster...FAIL
|
test_util.go:93: Failures:
|
test_util.go:95: 1: TestXdcrCreateCluster
|
test_util.go:106:
|
Pass: 0.000000
|
Fail: 1.000000
|
Pass Rate: 0.000000
|
test_util.go:117: suite contains failures
|
The remote end is using IP based alternate addresses:
kubectl --context gke_couchbase-engineering_us-west1_spjmurray -n remote exec -ti test-couchbase-vdkwf-0000 -- curl http://localhost:8091/pools/default/nodeServices -u Administrator:password | python3 -m json.tool
|
{
|
"rev": 39,
|
"nodesExt": [
|
{
|
"services": {
|
"mgmt": 8091,
|
"mgmtSSL": 18091,
|
"indexAdmin": 9100,
|
"indexScan": 9101,
|
"indexHttp": 9102,
|
"indexStreamInit": 9103,
|
"indexStreamCatchup": 9104,
|
"indexStreamMaint": 9105,
|
"indexHttps": 19102,
|
"kv": 11210,
|
"kvSSL": 11207,
|
"capi": 8092,
|
"capiSSL": 18092,
|
"projector": 9999,
|
"n1ql": 8093,
|
"n1qlSSL": 18093
|
},
|
"thisNode": true,
|
"hostname": "test-couchbase-vdkwf-0000.test-couchbase-vdkwf.remote.svc",
|
"alternateAddresses": {
|
"external": {
|
"hostname": "10.16.0.30",
|
"ports": {
|
"mgmt": 31671,
|
"mgmtSSL": 32548,
|
"kv": 31796,
|
"kvSSL": 31968,
|
"capi": 31383,
|
"capiSSL": 31979
|
}
|
}
|
}
|
},
|
{
|
"services": {
|
"mgmt": 8091,
|
"mgmtSSL": 18091,
|
"indexAdmin": 9100,
|
"indexScan": 9101,
|
"indexHttp": 9102,
|
"indexStreamInit": 9103,
|
"indexStreamCatchup": 9104,
|
"indexStreamMaint": 9105,
|
"indexHttps": 19102,
|
"kv": 11210,
|
"kvSSL": 11207,
|
"capi": 8092,
|
"capiSSL": 18092,
|
"projector": 9999,
|
"n1ql": 8093,
|
"n1qlSSL": 18093
|
},
|
"hostname": "test-couchbase-vdkwf-0001.test-couchbase-vdkwf.remote.svc",
|
"alternateAddresses": {
|
"external": {
|
"hostname": "10.16.0.34",
|
"ports": {
|
"mgmt": 31615,
|
"mgmtSSL": 31177,
|
"kv": 32342,
|
"kvSSL": 31076,
|
"capi": 32325,
|
"capiSSL": 32739
|
}
|
}
|
}
|
},
|
{
|
"services": {
|
"mgmt": 8091,
|
"mgmtSSL": 18091,
|
"indexAdmin": 9100,
|
"indexScan": 9101,
|
"indexHttp": 9102,
|
"indexStreamInit": 9103,
|
"indexStreamCatchup": 9104,
|
"indexStreamMaint": 9105,
|
"indexHttps": 19102,
|
"kv": 11210,
|
"kvSSL": 11207,
|
"capi": 8092,
|
"capiSSL": 18092,
|
"projector": 9999,
|
"n1ql": 8093,
|
"n1qlSSL": 18093
|
},
|
"hostname": "test-couchbase-vdkwf-0002.test-couchbase-vdkwf.remote.svc",
|
"alternateAddresses": {
|
"external": {
|
"hostname": "10.16.0.36",
|
"ports": {
|
"mgmt": 31020,
|
"mgmtSSL": 31086,
|
"kv": 31648,
|
"kvSSL": 31784,
|
"capi": 32130,
|
"capiSSL": 31562
|
}
|
}
|
}
|
}
|
],
|
"clusterCapabilitiesVer": [
|
1,
|
0
|
],
|
"clusterCapabilities": {
|
"n1ql": [
|
"enhancedPreparedStatements"
|
]
|
}
|
}
|
However the UI is telling the story that it's attempting to use DNS based addresses:
Why is this a Problem?
We strongly discourage the use of IP based alternate addressing--as DNS based is far superior, and still works thankfully. The reality of the situation is the vast majority of our customers use Red Hat Openshift, and that uses OVS as its networking layer, e.g. an overlay with a DNAT, forcing the use of IP based alternate addressing.
The big risk here is anyone doing an upgrade will find themselves unable to rollback and have their XDCR connections stop working.
Setup
- X.Y.default.svc are the XDCR source
- Establishes XDCR using an IP based "node port" URL
- X.Y.remote.svc are the XDCR target
- Has IP based alternate addresses exposed
- Logs coming in a follow up as I don't trust the Mrs' internet connection...
Attachments
Issue Links
- backports to
-
MB-39091 [BP 6.6] - Alternate IP Based XDCR Appears Broken
- Closed
-
MB-39687 [BP 6.6] - Alternate IP Based XDCR Appears Broken
- Closed
- is triggered by
-
MB-37761 [BP 6.5.1] - XDCR does not apply the correct alternate address heuristic
- Closed
- relates to
-
K8S-1451 Don't Load-balance Alternate Addresses
- Closed
-
MB-37684 XDCR Remote Cluster is not Idempotent due to lack of DNS SRV support
- Closed
-
DOC-6656 Release note - alternate IP based XDCR not working with Operator 2.0 and Server 6.5.1, requiring them to downgrade to 6.5.0
- Resolved