Loading...

XML

Word

Printable

Details

Type: Improvement
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: 6.5.0
Component/s: ns_server
Labels:
- OperatorRequested
Environment:
CAO 2.0.0

Description

Scenario:

When the Operator scales down we trigger a rebalance, then wait for it to complete before deleting the pods. For good reason as we want to avoid data loss. Initially we called the rebalance API then polled the rebalance task.

What happened with 6.5.0-beta2 was that sometimes the task took a while to appear, so we polled for the task appearing so we could capture this behavioral change (yes this is correct behavior - the prior solution was racy!) Not doing this would have resulted in the Operator declaring "no rebalance task, must be done", then we'd prematurely delete the pods...

Moving on to 6.5.0-rc1 the behavior changed again, now what we are seeing is rebalances can occur fully before we even start polling for the task (performance improvement - nice!). In this case we throw an error as we have not observed the rebalance. Fine, we don't do anything stupid and things work (users will quite rightly complain about errors in the logs however), but all of QE is predicated upon the determinism and reliability of events. It needs to see a rebalance started and successfully completed with the Operator providing these synchronization events.

Possible fix:

When we trigger a rebalance it accepts the request returns an ID (or a full HTTP path to poll - MattI likes this but in all honesty it's better for our architecture to just have an ID). A polling path should tell us whether the specific rebalance has even been registered yet (e.g. started), complete/failed (with some kind of error string would be great for our logs/supportability/testing), and finally in progress (with a percentage). The validity of the endpoint need only be a few moments beyond the end of the rebalance for garbage collection purposes. The key thing here is we trigger a rebalance, we need to be able to acknowledge that it has happened.

What could also work is always ensuring the task is always visible, and hangs around for a bit.

Attachments

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Dave Finlay

Reporter:: Simon Murray

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 06/Dec/19 1:51 AM

Updated:: 22/Jan/20 6:50 AM

Gerrit Reviews

There are no open Gerrit changes

Need a Better Rebalance Interface

Details

Description

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty