Uploaded image for project: 'Couchbase Kubernetes'
  1. Couchbase Kubernetes
  2. K8S-2679

Operator does not trigger Rebalance to remove foreign node.

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • 2.3.0
    • 2.3.0
    • operator
    • GKE K8S v1.21
    • 11: Blogs, Training, leftovers
    • 1

    Description

      Test Case: TestRemoveForeignNode (https://github.com/couchbase/couchbase-operator/blob/87ac6490bda7c355a47c70340e694d8a29f7be24/test/e2e/node_test.go#L310 )

      Test Case Summary: CB node added to the cluster while Operator is in the paused state should be removed once Operator is unpaused. Data loading in the default bucket is also in progress. 

      Couchbase Server Version: 7.0.3

      Operator, DAC, Certification: latest builds

      Error

      === CONT  TestOperator/TestRemoveForeignNode
          util.go:1390: context deadline exceeded: failed to wait for event RebalanceStarted/A rebalance has been started to balance data across the cluster
          util.go:1391: goroutine 519 [running]:
              runtime/debug.Stack()
              	runtime/debug/stack.go:24 +0x72
              github.com/couchbase/couchbase-operator/test/e2e/e2eutil.Die(0xc000726340, {0x47b9820, 0xc00093e000})
              	github.com/couchbase/couchbase-operator/test/e2e/e2eutil/util.go:1386 +0x36
              github.com/couchbase/couchbase-operator/test/e2e/e2eutil.mustWaitForResourceEventFromNow(0xc00000e108, 0x3b9313d, {0x47d7708, 0xc000124b00}, 0xc00000e108, 0x4)
              	github.com/couchbase/couchbase-operator/test/e2e/e2eutil/wait_util.go:315 +0x105
              github.com/couchbase/couchbase-operator/test/e2e/e2eutil.MustWaitForClusterEvent(...)
              	github.com/couchbase/couchbase-operator/test/e2e/e2eutil/wait_util.go:599
              github.com/couchbase/couchbase-operator/test/e2e.TestRemoveForeignNode(0x0)
              	github.com/couchbase/couchbase-operator/test/e2e/node_test.go:338 +0xd50
              testing.tRunner(0xc000726340, 0x3c84848)
              	testing/testing.go:1259 +0x230
              created by testing.(*T).Run
              	testing/testing.go:1306 +0x727
               

      Looking at the Operator logs, it was paused successfully and we see the corresponding messages: 

      {"level":"info","ts":1647511504.5473213,"logger":"cluster","msg":"Operator paused, skipping","cluster":"test-kztwf/test-couchbase-gcwvz"}
      {"level":"info","ts":1647511504.553323,"logger":"cluster","msg":"Resource updated","cluster":"test-kztwf/test-couchbase-gcwvz","diff":"  (\n  \t\"\"\"\n  \t... // 28 identical lines\n  \t  status: \"True\"\n  \t  type: Balanced\n+ \tcontrolPaused: true\n  \tcurrentVersion: 7.0.3\n  \tmembers:\n  \t... // 4 identical lines\n  \t\"\"\"\n  )\n"}
      {"level":"info","ts":1647511504.5906258,"logger":"cluster","msg":"Operator paused, skipping","cluster":"test-kztwf/test-couchbase-gcwvz"}
      {"level":"info","ts":1647511514.5468116,"logger":"cluster","msg":"Operator paused, skipping","cluster":"test-kztwf/test-couchbase-gcwvz"}
      {"level":"info","ts":1647511524.5536973,"logger":"cluster","msg":"Operator paused, skipping","cluster":"test-kztwf/test-couchbase-gcwvz"}
      {"level":"info","ts":1647511534.5658717,"logger":"cluster","msg":"Operator paused, skipping","cluster":"test-kztwf/test-couchbase-gcwvz"}
      {"level":"info","ts":1647511544.5741305,"logger":"cluster","msg":"Operator paused, skipping","cluster":"test-kztwf/test-couchbase-gcwvz"}
      

      The last line of the stream represents the last timestamp when the operator was in the paused state.

      Now looking at the couchbase server logs for the pod created by the operator, we see:

      ns_server.info.log - Rebalance Starts to add foreign node: 

      [user:info,2022-03-17T10:05:17.020Z,ns_1@test-couchbase-gcwvz-0000.test-couchbase-gcwvz.test-kztwf.svc:<0.945.0>:ns_orchestrator:idle:772]Starting rebalance, KeepNodes = ['ns_1@test-couchbase-gcwvz-0000.test-couchbase-gcwvz.test-kztwf.svc',
                                       'ns_1@test-couchbase-gcwvz-hrisovalantis.test-couchbase-gcwvz.test-kztwf.svc'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 540947809d921d233645ae1f740e9fa5
      [rebalance:info,2022-03-17T10:05:17.021Z,ns_1@test-couchbase-gcwvz-0000.test-couchbase-gcwvz.test-kztwf.svc:<0.4095.0>:ns_rebalancer:drop_old_2i_indexes:1274]Going to drop possible old 2i indexes on nodes ['ns_1@test-couchbase-gcwvz-hrisovalantis.test-couchbase-gcwvz.test-kztwf.svc'] 

      ns_server.info.log - Rebalance completes: 

      [user:info,2022-03-17T10:05:45.750Z,ns_1@test-couchbase-gcwvz-0000.test-couchbase-gcwvz.test-kztwf.svc:<0.945.0>:ns_orchestrator:log_rebalance_completion:1412]Rebalance completed successfully.
      Rebalance Operation Id = 540947809d921d233645ae1f740e9fa5 

      So, Rebalance to add the foreign node(test-couchbase-gcwvz-hrisovalantis.test-couchbase-gcwvz.test-kztwf.svc)starts and completes between [10:05:17.020 , 10:05:45.750]

      This matches our Operator paused time period: [starting ts 1647511504.5473213 UTC time: 10:05:04.547 and remained in the paused state till 10:05:46.276] confirming Operator did not intervene in the rebalance process.

      Operator is then unpaused. This should trigger Rebalance out operation for the foreign node. 

      However, we do not see any such requests made in the ns_server logs of the orchestrator pod apart from the one rebalance request made to add the foreign node.

      The only info we see in the ns_server logs post that are details about bucket compaction since data loading is in progress:

      [ns_server:info,2022-03-17T10:15:45.125Z,ns_1@test-couchbase-gcwvz-0000.test-couchbase-gcwvz.test-kztwf.svc:<0.2269.3>:compaction_daemon:maybe_compact_vbucket:740]Compacting 'default/1016', DataSize = 14497, FileSize = 139363, Options = {1647252345,
                                                                                 0,
                                                                                 false}
      [ns_server:info,2022-03-17T10:15:45.153Z,ns_1@test-couchbase-gcwvz-0000.test-couchbase-gcwvz.test-kztwf.svc:<0.2269.3>:compaction_daemon:maybe_compact_vbucket:743]Compaction of <<"default/1016">> has finished with ok
       

      The Operator logs had these messages for the foreign node after it was unpaused: 

      {"level":"info","ts":1647511546.2761085,"logger":"cluster","msg":"Resource updated","cluster":"test-kztwf/test-couchbase-gcwvz","diff":"  (\n  \t\"\"\"\n  \t... // 36 identical lines\n  \t  waitForAddressReachable: 10m0s\n  \t  waitForAddressReachableDelay: 2m0s\n- \tpaused: true\n  \tsecurity:\n  \t  adminSecret: basic-test-secret\n  \t... // 23 identical lines\n  \t\"\"\"\n  )\n"}
      {"level":"info","ts":1647511546.276161,"logger":"cluster","msg":"Pod ignored, no owner","cluster":"test-kztwf/test-couchbase-gcwvz","name":"test-couchbase-gcwvz-hrisovalantis"}
      {"level":"info","ts":1647511546.2761724,"logger":"cluster","msg":"Pod ignored, no owner","cluster":"test-kztwf/test-couchbase-gcwvz","name":"test-couchbase-gcwvz-hrisovalantis"}
      {"level":"debug","ts":1647511546.281521,"logger":"api","msg":"http","cluster":"test-kztwf/test-couchbase-gcwvz","method":"GET","url":"http://test-couchbase-gcwvz-0000.test-couchbase-gcwvz.test-kztwf.svc:8091/pools/default","status":"200 OK","time_ms":5.272114}
      {" 

      Hence, the assumption that the Operator did not send rebalance initiation request to ns_server.

      The test gives out the same test error when running against CB Server v6.6.5 as well.
      Logs attached.

      Attachments

        For Gerrit Dashboard: K8S-2679
        # Subject Branch Project Status CR V

        Activity

          People

            simon.murray Simon Murray
            prateek.kumar Prateek Kumar (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty