"We need to measure progress better. When vbuckets have lots of items we'll currently display progress as 'stuck' because rebalance progress is currently measure in terms of vbucket movements.
From user perspective we should be able to track how many items/bytes needs to be moved and how many is left. This will likely need some help from ep-engine folks.
One of the goals of 1.8.1 is better rebalance progress estimation. Particularly we should present user some ETA that's not too far off.
We've discussed this with Chiyoung today. Here's what we came up as initial approach.
ns_server is aware of all vbucket movements that need to be done. And in 1.8.x (maybe later than 1.8.1) we will also build replicas during rebalance. So building of replicas will also be taken into account in same way as takeovers.
For each needed movement we will look at vbucket stats on source and destination and see if backfill is needed or not. If backfill is needed we know how many items (or bytes, hopefully) will be moved. If backfill is not needed we will look at checkpoint stats and get same information. Then for each in-flight vbucket movement we already have stat that tells us how many items are pending in particular tap cursor. We will use vbucket movement completion (how many vbucket movements are done out from total count of movements we need) and this stat (how far are currently in-flight movements from done) to get us % of completion. And having rate of % completion change we'll get ETA.
We know its not taking into account on-going mutations and not taking into account temporary oom NAKs. But hopefully that still won't be too far off. I think that because we're going to refresh our estimates periodically by looking at vbucket stats for vbuckets which movement is still pending, it will account for on-going mutations and should work well enough.