Uploaded image for project: 'Couchbase Kubernetes'
  1. Couchbase Kubernetes
  2. K8S-2194

Document XDCR Doesn't Work on 6.5.1+

    XMLWordPrintable

Details

    • Page
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 2.2.0
    • 2.2.0
    • documentation
    • None
    • 22: Docs/Cleanup/CI
    • 1

    Description

      XDCR tests are failing in 6.5.2 with 2.2 RC1 build specifically while checking for the desired number of documents in the bucket of remote cluster.

      The function used to populate buckets is https://github.com/couchbase/couchbase-operator/blob/master/test/e2e/xdcr_test.go#L589 

      The replication resource created fails to find remote host.

      Job: http://qa.sc.couchbase.com/view/Cloud/job/k8s-cbop-gke-pipeline/189/consoleFull

      TestCase: TestXDCRTargetNodeServiceDelete

      Error:

      17:16:57     util.go:1288: timeout: document count 0, expected 100
      17:16:57     util.go:1289: goroutine 1060 [running]:
      17:16:57         runtime/debug.Stack(0x1f14859, 0xc000eda520, 0x296bc20)
      17:16:57         	/jenkins/workspace/k8s-cbop-gke-pipeline/go/src/runtime/debug/stack.go:24 +0xab
      17:16:57         github.com/couchbase/couchbase-operator/test/e2e/e2eutil.Die(0xc0005d6a80, 0x296bc20, 0xc000eda520)
      17:16:57         	/jenkins/workspace/k8s-cbop-gke-pipeline/test/e2e/e2eutil/util.go:1284 +0x34
      17:16:57         github.com/couchbase/couchbase-operator/test/e2e/e2eutil.MustVerifyDocCountInBucket(0xc0005d6a80, 0xc000590680, 0xc002402500, 0x261e31a, 0x7, 0x64, 0x8bb2c97000)
      17:16:57         	/jenkins/workspace/k8s-cbop-gke-pipeline/test/e2e/e2eutil/xdcr_util.go:166 +0xb9
      17:16:57         github.com/couchbase/couchbase-operator/test/e2e.TestXDCRTargetNodeServiceDelete(0xc0005d6a80)
      17:16:57         	/jenkins/workspace/k8s-cbop-gke-pipeline/test/e2e/xdcr_test.go:590 +0x12c7
      17:16:57         testing.tRunner(0xc0005d6a80, 0x26dde50)
      17:16:57         	/jenkins/workspace/k8s-cbop-gke-pipeline/go/src/testing/testing.go:1193 +0x203
      17:16:57         created by testing.(*T).Run
      17:16:57         	/jenkins/workspace/k8s-cbop-gke-pipeline/go/src/testing/testing.go:1238 +0x5d8
      17:16:57         
      17:16:57 time="2021-05-17T17:16:49-07:00" level=info msg="TestOperator/TestXDCRTargetNodeServiceDelete ✗" 

      (cboopinfo attached)

      (Server Logs attached)

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          simon.murray Simon Murray added a comment -

          Logs reveal the Operator is doing what it should. You may need to check the server UI/logs.

          simon.murray Simon Murray added a comment - Logs reveal the Operator is doing what it should. You may need to check the server UI/logs.
          simon.murray Simon Murray added a comment -

          So what it looks like, thanks for the attachments, is that even though the operator is using node port networking, the XDCR source is still trying to use DNS to connect, which is wrong. We've given it a node port, XDCR has obviously tried to bootstrap using the /pools/default endpoint, and has decided to use DNS anyway and not the alternative addresses. Smells like a server bug if anything. I'll have a look tomorrow.

          simon.murray Simon Murray added a comment - So what it looks like, thanks for the attachments, is that even though the operator is using node port networking, the XDCR source is still trying to use DNS to connect, which is wrong. We've given it a node port, XDCR has obviously tried to bootstrap using the /pools/default endpoint, and has decided to use DNS anyway and not the alternative addresses. Smells like a server bug if anything. I'll have a look tomorrow.
          simon.murray Simon Murray added a comment -

          Odd with

          test-couchbase -test TestXDCRTargetNodeServiceDelete -server-image couchbase/server:6.5.2
          

          I get...

          INFO[0106] Test Summary                         
          INFO[0106]    1: TestXDCRTargetNodeServiceDelete ✔ 
          INFO[0106] Suite Summary (custom)               
          INFO[0106]  ✔ Passes: 1 (100.00%) 
          

          simon.murray Simon Murray added a comment - Odd with test-couchbase -test TestXDCRTargetNodeServiceDelete -server-image couchbase/server:6.5.2 I get... INFO[0106] Test Summary INFO[0106] 1: TestXDCRTargetNodeServiceDelete ✔ INFO[0106] Suite Summary (custom) INFO[0106] ✔ Passes: 1 (100.00%)
          simon.murray Simon Murray added a comment -

          My image is...

          [Tue 25 May 10:08:10 BST 2021] simon@symphony ~/src/github.com/couchbase/couchbase-operator docker pull couchbase/server:6.5.2
          6.5.2: Pulling from couchbase/server
          Digest: sha256:23971267523fedc95c7c2973d8afd4715353a0749d228a75aa80e53c2cea978e
          Status: Image is up to date for couchbase/server:6.5.2
          docker.io/couchbase/server:6.5.2
          

          Looking at the nodes I can see this is the image in use...

          couchbase/server@sha256:4b68f699849705e06c0fdce42ca5ba752268e546bd7fc75b0373cd3c4b8671bb
          

          Which is somewhat different from what's cached on your nodes.

          simon.murray Simon Murray added a comment - My image is... [Tue 25 May 10:08:10 BST 2021] simon@symphony ~/src/github.com/couchbase/couchbase-operator docker pull couchbase/server:6.5.2 6.5.2: Pulling from couchbase/server Digest: sha256:23971267523fedc95c7c2973d8afd4715353a0749d228a75aa80e53c2cea978e Status: Image is up to date for couchbase/server:6.5.2 docker.io/couchbase/server:6.5.2 Looking at the nodes I can see this is the image in use... couchbase/server@sha256:4b68f699849705e06c0fdce42ca5ba752268e546bd7fc75b0373cd3c4b8671bb Which is somewhat different from what's cached on your nodes.
          simon.murray Simon Murray added a comment -

          Okay, so running this on a real setup. not just Kind, I do get the same error...

          Now given the test has

          framework.Requires(t, k8s1).CouchbaseBucket().NotVersion("6.5.1")
          

          It stands to reason this one is also broken. Perhaps it (and all the other ones) need to be...

          framework.Requires(t, k8s1).CouchbaseBucket().NotVersion("6.5.1").NotVersion("6.5.2")
          

          simon.murray Simon Murray added a comment - Okay, so running this on a real setup. not just Kind, I do get the same error... Now given the test has framework.Requires(t, k8s1).CouchbaseBucket().NotVersion("6.5.1") It stands to reason this one is also broken. Perhaps it (and all the other ones) need to be... framework.Requires(t, k8s1).CouchbaseBucket().NotVersion("6.5.1").NotVersion("6.5.2")
          simon.murray Simon Murray added a comment -

          I think I remember now, it broke in 6.5.1 because it tries to match the connection address (e.g. a node port) against an alternative address. If this fails it falls back to using normal addresses, which is exactly what we are seeing here. The ability to specify the network to use only came in 6.6.

          simon.murray Simon Murray added a comment - I think I remember now, it broke in 6.5.1 because it tries to match the connection address (e.g. a node port) against an alternative address. If this fails it falls back to using normal addresses, which is exactly what we are seeing here. The ability to specify the network to use only came in 6.6.
          simon.murray Simon Murray added a comment -

          I'll take ownership...

          simon.murray Simon Murray added a comment - I'll take ownership...

          People

            simon.murray Simon Murray
            prateek.kumar Prateek Kumar
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty