Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-35098

CBAS Rebalance Getting 'Stuck'

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • 6.0.3
    • 6.0.1
    • analytics
    • GKE, Operator 2.0.0
    • CX Sprint 159, CX Sprint 160, CX Sprint 161, CX Sprint 162, CX Sprint 163, CX Sprint 164, CX Sprint 165, CX Sprint 166

    Description

      Test scenario:

      • Three node cluster created.
      • 3 data sets created, one encompassing the whole bucket, one for document IDs matching anything with a 1 in it, another for the inverse.
      • Load generated with cbc_pillowfight.
      • Pods killed one after the other allowing for the operator to repair the cluster.
      • Rebalance appears fails on the first attempt (internally we wait for the task to complete, then poll the cluster status a few times to see if NS server requires a rebalance), then succeeds on a retry (possibly related to MB-34928, but we still expect the rebalance to succeed)
      • On the final killing the rebalance appears to stall.

      Cluster name: test-couchbase-q7pr9

      In the operator logs (cbopinfo-20190715T161808+0100/default/deployment/couchbase-operator/couchbase-operator.log) we see (again possibly related to MB-34928):

      {"level":"info","ts":1563203076.304325,"logger":"cluster","msg":"Pods failed over","cluster":"test-couchbase-q7pr9"}
      {"level":"info","ts":1563203076.3044674,"logger":"cluster","msg":"Pod unrecoverable","cluster":"test-couchbase-q7pr9","name":"test-couchbase-q7pr9-0000","reason":"No volume mounts defined"}
      {"level":"info","ts":1563203076.3044827,"logger":"cluster","msg":"Pod failed, deleting","cluster":"test-couchbase-q7pr9","name":"test-couchbase-q7pr9-0000"}
      {"level":"info","ts":1563203078.3110914,"logger":"cluster","msg":"Creating pod","cluster":"test-couchbase-q7pr9","name":"test-couchbase-q7pr9-0003","image":"couchbase/server:enterprise-6.0.1"}
      {"level":"info","ts":1563203100.3373904,"logger":"cluster","msg":"Pod added to cluster","cluster":"test-couchbase-q7pr9","name":"test-couchbase-q7pr9-0003"}
      {"level":"info","ts":1563203100.4955597,"logger":"cluster","msg":"External address collection failed","cluster":"test-couchbase-q7pr9","name":"test-couchbase-q7pr9-0000"}
      {"level":"info","ts":1563203101.137671,"logger":"couchbaseutil","msg":"Rebalancing","progress":0}
      {"level":"info","ts":1563203105.1573079,"logger":"couchbaseutil","msg":"Rebalancing","progress":2.978124323153564}
      {"level":"info","ts":1563203109.1806362,"logger":"couchbaseutil","msg":"Rebalancing","progress":10.5154862464804}
      {"level":"info","ts":1563203113.1996868,"logger":"couchbaseutil","msg":"Rebalancing","progress":17.98787091184752}
      {"level":"info","ts":1563203117.224926,"logger":"couchbaseutil","msg":"Rebalancing","progress":25.43859649122807}
      {"level":"info","ts":1563203121.2435634,"logger":"couchbaseutil","msg":"Rebalancing","progress":32.68356075373619}
      {"level":"debug","ts":1563203135.308571,"logger":"cluster","msg":"Reconciliation completed","cluster":"test-couchbase-q7pr9"}
      {"level":"error","ts":1563203135.308714,"logger":"cluster","msg":"Reconciliation failed","cluster":"test-couchbase-q7pr9","error":"failed to rebalance: cluster reports rebalance incomplete","stacktrace":"github.com/couchbase/couchbase-operator/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/home/simon/go/src/github.com/couchbase/couchbase-operator/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).runReconcile\n\t/home/simon/go/src/github.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:382\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).Update\n\t/home/simon/go/src/github.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:399\ngithub.com/couchbase/couchbase-operator/pkg/controller.(*CouchbaseClusterReconciler).Reconcile\n\t/home/simon/go/src/github.com/couchbase/couchbase-operator/pkg/controller/controller.go:86\ngithub.com/couchbase/couchbase-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/simon/go/src/github.com/couchbase/couchbase-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:215\ngithub.com/couchbase/couchbase-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/home/simon/go/src/github.com/couchbase/couchbase-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\ngithub.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/home/simon/go/src/github.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\ngithub.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/home/simon/go/src/github.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\ngithub.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait.Until\n\t/home/simon/go/src/github.com/couchbase/couchbase-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
      {"level":"debug","ts":1563203135.3773098,"logger":"cluster","msg":"Reconciliation starting","cluster":"test-couchbase-q7pr9"}

      But it does finally fix itself.  Note how long this rebalance takes.

      However on the final pod slaying we get this...

      {"level":"info","ts":1563203336.4409792,"logger":"cluster","msg":"Pods failed over","cluster":"test-couchbase-q7pr9"}
      {"level":"info","ts":1563203336.4410956,"logger":"cluster","msg":"Pod unrecoverable","cluster":"test-couchbase-q7pr9","name":"test-couchbase-q7pr9-0002","reason":"No volume mounts defined"}
      {"level":"info","ts":1563203336.44113,"logger":"cluster","msg":"Pod failed, deleting","cluster":"test-couchbase-q7pr9","name":"test-couchbase-q7pr9-0002"}
      {"level":"info","ts":1563203338.4788873,"logger":"cluster","msg":"Creating pod","cluster":"test-couchbase-q7pr9","name":"test-couchbase-q7pr9-0005","image":"couchbase/server:enterprise-6.0.1"}
      {"level":"info","ts":1563203360.2708702,"logger":"cluster","msg":"Pod added to cluster","cluster":"test-couchbase-q7pr9","name":"test-couchbase-q7pr9-0005"}
      {"level":"info","ts":1563203360.4640975,"logger":"cluster","msg":"External address collection failed","cluster":"test-couchbase-q7pr9","name":"test-couchbase-q7pr9-0002"}
      {"level":"info","ts":1563203361.0830722,"logger":"couchbaseutil","msg":"Rebalancing","progress":0}
      {"level":"info","ts":1563203365.1335223,"logger":"couchbaseutil","msg":"Rebalancing","progress":3.869958952351842}
      {"level":"info","ts":1563203369.1519842,"logger":"couchbaseutil","msg":"Rebalancing","progress":11.06540621770822}
      {"level":"info","ts":1563203373.163263,"logger":"couchbaseutil","msg":"Rebalancing","progress":18.12210904723971}
      {"level":"info","ts":1563203377.1741796,"logger":"couchbaseutil","msg":"Rebalancing","progress":25.01671994009732}
      {"level":"info","ts":1563203381.186079,"logger":"couchbaseutil","msg":"Rebalancing","progress":31.75671925785666}
      {"level":"info","ts":1563203385.1933627,"logger":"couchbaseutil","msg":"Rebalancing","progress":66.66666766666667}
      {"level":"info","ts":1563203389.2043526,"logger":"couchbaseutil","msg":"Rebalancing","progress":66.666669}
      {"level":"info","ts":1563203393.2122645,"logger":"couchbaseutil","msg":"Rebalancing","progress":66.66667033333334}
      {"level":"info","ts":1563203397.2163284,"logger":"couchbaseutil","msg":"Rebalancing","progress":66.66667166666666}
      {"level":"info","ts":1563203401.2228284,"logger":"couchbaseutil","msg":"Rebalancing","progress":66.666673}
      {"level":"info","ts":1563203405.233028,"logger":"couchbaseutil","msg":"Rebalancing","progress":66.66667433333333}
      {"level":"info","ts":1563203409.242073,"logger":"couchbaseutil","msg":"Rebalancing","progress":66.66667566666668}
      {"level":"info","ts":1563203413.2541904,"logger":"couchbaseutil","msg":"Rebalancing","progress":66.66667699999999}

      Then we get a timeout and fail the test.

       

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              simon.murray Simon Murray
              simon.murray Simon Murray
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty