Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-37180

Need a Better Rebalance Interface

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Unresolved
    • Major
    • None
    • 6.5.0
    • ns_server
    • CAO 2.0.0

    Description

      Scenario:

      When the Operator scales down we trigger a rebalance, then wait for it to complete before deleting the pods.  For good reason as we want to avoid data loss.  Initially we called the rebalance API then polled the rebalance task.

      What happened with 6.5.0-beta2 was that sometimes the task took a while to appear, so we polled for the task appearing so we could capture this behavioral change (yes this is correct behavior - the prior solution was racy!)  Not doing this would have resulted in the Operator declaring "no rebalance task, must be done", then we'd prematurely delete the pods...

      Moving on to 6.5.0-rc1 the behavior changed again, now what we are seeing is rebalances can occur fully before we even start polling for the task (performance improvement - nice!).  In this case we throw an error as we have not observed the rebalance.  Fine, we don't do anything stupid and things work (users will quite rightly complain about errors in the logs however), but all of QE is predicated upon the determinism and reliability of events.  It needs to see a rebalance started and successfully completed with the Operator providing these synchronization events.

      Possible fix:

      When we trigger a rebalance it accepts the request returns an ID (or a full HTTP path to poll - MattI likes this but in all honesty it's better for our architecture to just have an ID).  A polling path should tell us whether the specific rebalance has even been registered yet (e.g. started), complete/failed (with some kind of error string would be great for our logs/supportability/testing), and finally in progress (with a percentage).  The validity of the endpoint need only be a few moments beyond the end of the rebalance for garbage collection purposes.  The key thing here is we trigger a rebalance, we need to be able to acknowledge that it has happened.

      What could also work is always ensuring the task is always visible, and hangs around for a bit.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            dfinlay Dave Finlay
            simon.murray Simon Murray
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty