Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-38995

Alternate IP Based XDCR Appears Broken

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 6.5.1
    • 7.0.0
    • XDCR
    • None
    • Kubernetes (any), CAO 2.0.0
    • Untriaged
    • Ubuntu 64-bit
    • 1
    • Unknown

    Description

      What

      I noted when testing Operator 2.0.1 that XDCR in our sanity suite (e.g. should never fail) was throwing up an error.  I had only changed security settings so nothing to do with XDCR which was immediately suspect.

      Here's Operator 2.0.0 running against 6.5.0:

       

      $ tco -t TestXdcrCreateCluster --server-image couchbase/server:6.5.0 -c gke_couchbase-engineering_us-east1_spjmurray -c gke_couchbase-engineering_us-west1_spjmurray -i couchbase/operator:2.0.0 -I couchbase/admission-controller:2.0.0
      === RUN   TestOperator
      === RUN   TestOperator/TestXdcrCreateCluster
      PASS
      --- PASS: TestOperator (312.74s)
          --- PASS: TestOperator/TestXdcrCreateCluster (195.50s)
              crd_util.go:26: creating couchbase cluster: test-couchbase-czzbs
              crd_util.go:26: creating couchbase cluster: test-couchbase-724rs
          test_util.go:35: Suite Test Results: 
          test_util.go:64: 1: TestXdcrCreateCluster...PASS
          test_util.go:106: 
               Pass: 1.000000 
               Fail: 0.000000 
               Pass Rate: 100.000000

      and against 6.5.1:

      tco -t TestXdcrCreateCluster --server-image couchbase/server:6.5.1 -c gke_couchbase-engineering_us-east1_spjmurray -c gke_couchbase-engineering_us-west1_spjmurray -i couchbase/operator:2.0.0 -I couchbase/admission-controller:2.0.0
      === RUN   TestOperator
      === RUN   TestOperator/TestXdcrCreateCluster
      FAIL
      --- FAIL: TestOperator (891.61s)
          --- FAIL: TestOperator/TestXdcrCreateCluster (781.25s)
              crd_util.go:26: creating couchbase cluster: test-couchbase-f2sv5
              crd_util.go:26: creating couchbase cluster: test-couchbase-vdkwf
              util.go:1304: context deadline exceeded: document count 0, expected 10
              util.go:1305: goroutine 531 [running]:
                  runtime/debug.Stack(0xc000580300, 0xc000b7dc50, 0x1)
                  	/usr/local/go/src/runtime/debug/stack.go:24 +0xab
                  github.com/couchbase/couchbase-operator/test/e2e/e2eutil.Die(0xc000580300, 0x23481e0, 0xc000434a60)
                  	/home/simon/go/src/github.com/couchbase/couchbase-operator/test/e2e/e2eutil/util.go:1305 +0x85
                  github.com/couchbase/couchbase-operator/test/e2e/e2eutil.MustVerifyDocCountInBucket(0xc000580300, 0xc000331680, 0xc000334d80, 0x20392c7, 0x7, 0xa, 0x8bb2c97000)
                  	/home/simon/go/src/github.com/couchbase/couchbase-operator/test/e2e/e2eutil/xdcr_util.go:120 +0xb5
                  github.com/couchbase/couchbase-operator/test/e2e.TestXdcrCreateCluster(0xc000580300)
                  	/home/simon/go/src/github.com/couchbase/couchbase-operator/test/e2e/xdcr_test.go:336 +0x784
                  github.com/couchbase/couchbase-operator/test/e2e/framework.RecoverDecorator.func1(0xc000580300)
                  	/home/simon/go/src/github.com/couchbase/couchbase-operator/test/e2e/framework/test_util.go:347 +0x85
                  testing.tRunner(0xc000580300, 0xc00084e5b0)
                  	/usr/local/go/src/testing/testing.go:909 +0x19a
                  created by testing.(*T).Run
                  	/usr/local/go/src/testing/testing.go:960 +0x652
                  
          test_util.go:35: Suite Test Results: 
          test_util.go:67: 1: TestXdcrCreateCluster...FAIL
          test_util.go:93: Failures: 
          test_util.go:95: 1: TestXdcrCreateCluster
          test_util.go:106: 
               Pass: 0.000000 
               Fail: 1.000000 
               Pass Rate: 0.000000
          test_util.go:117: suite contains failures

      The remote end is using IP based alternate addresses:

      kubectl --context gke_couchbase-engineering_us-west1_spjmurray -n remote exec -ti test-couchbase-vdkwf-0000 -- curl http://localhost:8091/pools/default/nodeServices -u Administrator:password | python3 -m json.tool
      {
          "rev": 39,
          "nodesExt": [
              {
                  "services": {
                      "mgmt": 8091,
                      "mgmtSSL": 18091,
                      "indexAdmin": 9100,
                      "indexScan": 9101,
                      "indexHttp": 9102,
                      "indexStreamInit": 9103,
                      "indexStreamCatchup": 9104,
                      "indexStreamMaint": 9105,
                      "indexHttps": 19102,
                      "kv": 11210,
                      "kvSSL": 11207,
                      "capi": 8092,
                      "capiSSL": 18092,
                      "projector": 9999,
                      "n1ql": 8093,
                      "n1qlSSL": 18093
                  },
                  "thisNode": true,
                  "hostname": "test-couchbase-vdkwf-0000.test-couchbase-vdkwf.remote.svc",
                  "alternateAddresses": {
                      "external": {
                          "hostname": "10.16.0.30",
                          "ports": {
                              "mgmt": 31671,
                              "mgmtSSL": 32548,
                              "kv": 31796,
                              "kvSSL": 31968,
                              "capi": 31383,
                              "capiSSL": 31979
                          }
                      }
                  }
              },
              {
                  "services": {
                      "mgmt": 8091,
                      "mgmtSSL": 18091,
                      "indexAdmin": 9100,
                      "indexScan": 9101,
                      "indexHttp": 9102,
                      "indexStreamInit": 9103,
                      "indexStreamCatchup": 9104,
                      "indexStreamMaint": 9105,
                      "indexHttps": 19102,
                      "kv": 11210,
                      "kvSSL": 11207,
                      "capi": 8092,
                      "capiSSL": 18092,
                      "projector": 9999,
                      "n1ql": 8093,
                      "n1qlSSL": 18093
                  },
                  "hostname": "test-couchbase-vdkwf-0001.test-couchbase-vdkwf.remote.svc",
                  "alternateAddresses": {
                      "external": {
                          "hostname": "10.16.0.34",
                          "ports": {
                              "mgmt": 31615,
                              "mgmtSSL": 31177,
                              "kv": 32342,
                              "kvSSL": 31076,
                              "capi": 32325,
                              "capiSSL": 32739
                          }
                      }
                  }
              },
              {
                  "services": {
                      "mgmt": 8091,
                      "mgmtSSL": 18091,
                      "indexAdmin": 9100,
                      "indexScan": 9101,
                      "indexHttp": 9102,
                      "indexStreamInit": 9103,
                      "indexStreamCatchup": 9104,
                      "indexStreamMaint": 9105,
                      "indexHttps": 19102,
                      "kv": 11210,
                      "kvSSL": 11207,
                      "capi": 8092,
                      "capiSSL": 18092,
                      "projector": 9999,
                      "n1ql": 8093,
                      "n1qlSSL": 18093
                  },
                  "hostname": "test-couchbase-vdkwf-0002.test-couchbase-vdkwf.remote.svc",
                  "alternateAddresses": {
                      "external": {
                          "hostname": "10.16.0.36",
                          "ports": {
                              "mgmt": 31020,
                              "mgmtSSL": 31086,
                              "kv": 31648,
                              "kvSSL": 31784,
                              "capi": 32130,
                              "capiSSL": 31562
                          }
                      }
                  }
              }
          ],
          "clusterCapabilitiesVer": [
              1,
              0
          ],
          "clusterCapabilities": {
              "n1ql": [
                  "enhancedPreparedStatements"
              ]
          }
      }
      

      However the UI is telling the story that it's attempting to use DNS based addresses:

      Why is this a Problem?

      We strongly discourage the use of IP based alternate addressing--as DNS based is far superior, and still works thankfully.  The reality of the situation is the vast majority of our customers use Red Hat Openshift, and that uses OVS as its networking layer, e.g. an overlay with a DNAT, forcing the use of IP based alternate addressing.

      The big risk here is anyone doing an upgrade will find themselves unable to rollback and have their XDCR connections stop working.

      Setup

      • X.Y.default.svc are the XDCR source
        • Establishes XDCR using an IP based "node port" URL
      • X.Y.remote.svc are the XDCR target
        • Has IP based alternate addresses exposed
      • Logs coming in a follow up as I don't trust the Mrs' internet connection...

       

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            simon.murray Simon Murray added a comment -

            Just FWIW I'd have thought it would first ask, can I see normal addresses, if so use them because they are likely to be faster/have no NAT, then ask can I see the alternate ones as a fall back.  Obviously that's not ideal because while the former may work (due to the magic of DNS shadowing) you may actually intend to use the latter.  Being explicit about your choices is a good thing.  From the SDK thing I'm guessing we are dealing with https://console.foo.cloud.couchbase.com:18091?network=external or http://10.9.8.7:31234?network=external which would be totally cool as it's just a simple docs change then.

            simon.murray Simon Murray added a comment - Just FWIW I'd have thought it would first ask, can I see normal addresses, if so use them because they are likely to be faster/have no NAT, then ask can I see the alternate ones as a fall back.  Obviously that's not ideal because while the former may work (due to the magic of DNS shadowing) you may actually intend to use the latter.  Being explicit about your choices is a good thing.  From the SDK thing I'm guessing we are dealing with https://console.foo.cloud.couchbase.com:18091?network=external or http://10.9.8.7:31234?network=external which would be totally cool as it's just a simple docs change then.
            neil.huang Neil Huang added a comment - - edited

            Thanks Simon Murray for the explanation.

            Add an explicit "network" specification to the cluster reference (or replication spec) as is done in the SDKs so that the attempt to guess the user's intent is removed. Default would be "auto" or try to guess intent; but we'd allow explicit specification. Is an API change of course - so seems like 6.6 would be the earliest.

             XDCR will go with this approach. This will ensure going forward that both Operator and Cloud DBas can route to the right address. The expectation is that Operator and Cloud will be the only users so the plan atm is to introduce it as a hidden flag as part of remote cluster creation.

            neil.huang Neil Huang added a comment - - edited Thanks Simon Murray  for the explanation. Add an explicit "network" specification to the cluster reference (or replication spec) as is done in the SDKs so that the attempt to guess the user's intent is removed. Default would be "auto" or try to guess intent; but we'd allow explicit specification. Is an API change of course - so seems like 6.6 would be the earliest.  XDCR will go with this approach. This will ensure going forward that both Operator and Cloud DBas can route to the right address. The expectation is that Operator and Cloud will be the only users so the plan atm is to introduce it as a hidden flag as part of remote cluster creation.
            neil.huang Neil Huang added a comment -

            A new optional REST flag is added, with the key of "network_type".

            Valid values are:

            • "external" - Enforce that XDCR to use alternate addresses whenever possible (6.5.0 behavior). This will be used by K8 Operator and/or DBAS.
            • "default" - Enforce XDCR to use internal (default) addresses when communicating to remote cluster. This is similar to an option that SDK provides.

            An example of it would be:

             curl -X POST -u Administrator:wewewe [http://127.0.0.1:9000/pools/default/remoteClusters] -d name=self -d hostname=127.0.0.1:9001 -d username=Administrator -d password=wewewe -d network_type=external
            

            The flag shares the same name as one presented in gocb. The difference here is that gocb allows the parameter to be used as part of the connection string, but XDCR requires it to be passed in as a REST parameter. This is because XDCR's hostname field behaves differently from a SDK connection string.

            neil.huang Neil Huang added a comment - A new optional REST flag is added, with the key of "network_type". Valid values are: "external" - Enforce that XDCR to use alternate addresses whenever possible (6.5.0 behavior). This will be used by K8 Operator and/or DBAS. "default" - Enforce XDCR to use internal (default) addresses when communicating to remote cluster. This is similar to an option that SDK provides. An example of it would be:  curl -X POST -u Administrator:wewewe [http://127.0.0.1:9000/pools/default/remoteClusters] -d name=self -d hostname=127.0.0.1:9001 -d username=Administrator -d password=wewewe -d network_type=external The flag shares the same name as one presented in gocb . The difference here is that gocb allows the parameter to be used as part of the connection string, but XDCR requires it to be passed in as a REST parameter. This is because XDCR's hostname field behaves differently from a SDK connection string.

            Build couchbase-server-7.0.0-2228 contains goxdcr commit c2b8b9a with commit message:
            MB-38995 - Added a network_mode flag similar to gocb to ensure users can skip heuristics to use external or default

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-2228 contains goxdcr commit c2b8b9a with commit message: MB-38995 - Added a network_mode flag similar to gocb to ensure users can skip heuristics to use external or default

            Verified by regression in 7.0.0-4617:

            14:29:32 — PASS: TestOperator/TestXDCRCreateCluster (238.12s)
            14:29:32 crd_util.go:34: creating couchbase cluster: test-couchbase-whbwt
            14:29:32 crd_util.go:34: creating couchbase cluster: test-couchbase-h4rpf
            14:29:32 xdcr_util.go:122: inserted 100 documents in 8.17438137s

            Manually validated the new option added:

            [root@node1-cb700-beta-centos7 ~]# curl -X POST -u Administrator:password http://10.112.210.101:8091/pools/default/remoteClusters -d name=self -d hostname=10.112.210.102:8091 -d username=Administrator -d password=password -d network_type=default
            {"deleted":false,"hostname":"10.112.210.102:8091","name":"self","secureType":"none","uri":"/pools/default/remoteClusters/self","username":"Administrator","uuid":"f107324c2b6649a4ba0f3d235c85f666","validateURI":"/pools/default/remoteClusters/self?just_validate=1"}[root@node1-cb700-beta-centos7 ~]#
            [root@node1-cb700-beta-centos7 ~]#
            [root@node1-cb700-beta-centos7 ~]# curl -X POST -u Administrator:password http://10.112.210.101:8091/pools/default/remoteClusters -d name=self -d hostname=10.112.210.102:8091 -d username=Administrator -d password=password -d network_type=external
            {"deleted":false,"hostname":"10.112.210.102:8091","name":"self","secureType":"none","uri":"/pools/default/remoteClusters/self","username":"Administrator","uuid":"f107324c2b6649a4ba0f3d235c85f666","validateURI":"/pools/default/remoteClusters/self?just_validate=1"}[root@node1-cb700-beta-centos7 ~]#
            [root@node1-cb700-beta-centos7 ~]#
            [root@node1-cb700-beta-centos7 ~]# curl -X POST -u Administrator:password http://10.112.210.101:8091/pools/default/remoteClusters -d name=self -d hostname=10.112.210.102:8091 -d username=Administrator -d password=password -d network_type=rgqerg
            {"network_type":"network_type specified is invalid"}[root@node1-cb700-beta-centos7 ~]#
            [root@node1-cb700-beta-centos7 ~]#
            [root@node1-cb700-beta-centos7 ~]#
            [root@node1-cb700-beta-centos7 ~]# curl -X POST -u Administrator:password http://10.112.210.101:8091/pools/default/remoteClusters -d name=self -d hostname=10.112.210.102:8091 -d username=Administrator -d password=password -d network_type=1
            {"network_type":"network_type specified is invalid"}[root@node1-cb700-beta-centos7 ~]#
            

            arunkumar Arunkumar Senthilnathan added a comment - Verified by regression in 7.0.0-4617: 14:29:32 — PASS: TestOperator/TestXDCRCreateCluster (238.12s) 14:29:32 crd_util.go:34: creating couchbase cluster: test-couchbase-whbwt 14:29:32 crd_util.go:34: creating couchbase cluster: test-couchbase-h4rpf 14:29:32 xdcr_util.go:122: inserted 100 documents in 8.17438137s Manually validated the new option added: [root@node1-cb700-beta-centos7 ~]# curl -X POST -u Administrator:password http://10.112.210.101:8091/pools/default/remoteClusters -d name=self -d hostname=10.112.210.102:8091 -d username=Administrator -d password=password -d network_type=default {"deleted":false,"hostname":"10.112.210.102:8091","name":"self","secureType":"none","uri":"/pools/default/remoteClusters/self","username":"Administrator","uuid":"f107324c2b6649a4ba0f3d235c85f666","validateURI":"/pools/default/remoteClusters/self?just_validate=1"}[root@node1-cb700-beta-centos7 ~]# [root@node1-cb700-beta-centos7 ~]# [root@node1-cb700-beta-centos7 ~]# curl -X POST -u Administrator:password http://10.112.210.101:8091/pools/default/remoteClusters -d name=self -d hostname=10.112.210.102:8091 -d username=Administrator -d password=password -d network_type=external {"deleted":false,"hostname":"10.112.210.102:8091","name":"self","secureType":"none","uri":"/pools/default/remoteClusters/self","username":"Administrator","uuid":"f107324c2b6649a4ba0f3d235c85f666","validateURI":"/pools/default/remoteClusters/self?just_validate=1"}[root@node1-cb700-beta-centos7 ~]# [root@node1-cb700-beta-centos7 ~]# [root@node1-cb700-beta-centos7 ~]# curl -X POST -u Administrator:password http://10.112.210.101:8091/pools/default/remoteClusters -d name=self -d hostname=10.112.210.102:8091 -d username=Administrator -d password=password -d network_type=rgqerg {"network_type":"network_type specified is invalid"}[root@node1-cb700-beta-centos7 ~]# [root@node1-cb700-beta-centos7 ~]# [root@node1-cb700-beta-centos7 ~]# [root@node1-cb700-beta-centos7 ~]# curl -X POST -u Administrator:password http://10.112.210.101:8091/pools/default/remoteClusters -d name=self -d hostname=10.112.210.102:8091 -d username=Administrator -d password=password -d network_type=1 {"network_type":"network_type specified is invalid"}[root@node1-cb700-beta-centos7 ~]#

            People

              arunkumar Arunkumar Senthilnathan
              simon.murray Simon Murray
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty